They did work like crazy for a decade to give us Zen... meanwhile intel was hiring more advertisers to run them since they though that they would never have competition, even if AMD was telling people that Phenom was just a temporary stone to what they would release in the future... just like NVidia did...
With M1 beeing pretty successful with its unified memory architecture, it could be that more special purpose Server CPUs might get not only larger caches, but also in-package memory (kind of L4 cache) in conjunction with external S/DRAM or PMEM modules.
While this would technically be better for performance, it would be awful for repairability and upgradability. Want more RAM? Replace your $3000 server CPU. RAM chip died? Replace your $3000 CPU. etc. I can tolerate everything being built-in for throwaway consumer devices like smartphones, tablets, and AirBooks (though it pains me that these items are throwaways), but I can't tolerate it for big heavy servers.
@@deusexaethera What about forcing the producer to make them repairable. At least for the producer. Open the package, test, unsolder and replace the defective component, close the package. Otherwise its not allowed to sell. End of story. We should not trade power efficiency for environment efficency.
Yes please. The L4 cache on Broadwell-H crystalwell was already awesome in workloads that uses it can't imagine the awesomeness on Zen with faster HBM.
@@SaberusTerras That would kind of defeat their purpose though... Like maybe for stuff like SFF stuff like Intel has their NUC but thats about it. HBM would make the APUs a *lot* more expensive, at which point, a dedicated GPU would make more sense.
@@FinlayDaG33k Right now it does, but costs trend down over time. There was a time when SRAM was expensive and adding L2 cache to your motherboard was optional because of it.
The first server system I worked on had 8 Mbytes of memory. Every now and then I have to pause and contemplate the massive shift that has taken place in my lifetime. I'm glad I'm mainly retired now as the speed of change is making me dizzy!
It is estimated that Millennials will witness the equivalent of 20,000 years of pre-Industrial technological advancement in the 80-ish years we're supposed to live. No wonder so many younger people have anxiety disorders -- nothing we learn is relevant for more than 6 months; we never get to feel like we're finally up-to-speed.
Except that every time CPU manufacturers figure out benefits that work around VMware licensing costs, VMware changes their licensing system. It helps for 1-3 years depending on where you are in your licensing cycle then they adapt. They are like the Borg. Soon they will count X amount of processor cache as a socket or core or something...
As a PC gamer being what got me into all of this..... server hardware is cool cause it's just the PC gaming hardware of the future, and if PC building is like Legos, wait till you see how hotswap server hw is lol 👍 Edit: and plex is the gateway drug to sysadmin like games are to computer hardware
Using the TSVs, AMD could stack a unified L4 cache of SRAM right on top of the I/O die and leave the extra room on the substrate for more chiplets that actually do work. I can see a lot of workloads that would benefit from that, especially with the Intel and AMD finally moving to heterogeneous packaging and AMD buying the FPGA firm. Then you can design a line of FPGAs that have their own V-Cache and can be linked to the unified L4 allowing the FPGA and X86-64 chips to share data more efficiently. Plus with AMD’s Infinity Fabric running equal to the Memory Clock and that clock is about to accelerate by leaps and bounds as DDR5 matures. The absolutely gigantic caches will offset the latency penalty we are looking at from the move to DDR5, the FClock will (eventually) be running 4200-5000+ up from today’s paltry 1900MHz, and still at 1:1 with the MClock.
A good topic for a video might be "the future of small servers". All the changes you have described are mostly applicable for the cloud players and hugee data centers. What is the future for R720 homelabbers :)
Home labs are going to be very different. As technology changes more rapidly, the value of older gear decreases much faster. Still, servers are going to power/ size levels where they will not be suitable for home labs. Much of the home lab community today is feasting on the leftovers from the Xeon E5 stagnation era. That is why we have this series. We want to show people the inflection is coming, and soon.
@@ServeTheHomeVideo Can you expand a bit more on this perhaps? For example - do you expect home labbers to continue feasting on the old tech until these disagreggated architectures become outdated (so business as usual for the next 5 years or so) or do you expect home labs dramatically shifting away after the inflection point to the disagreggated model but scaled down to lesser needs? It would be great if you could also explain why you expect one or the other (or some third way)?
Good to see HBM finally coming to CPUs (/SoCs), been waiting for this for a few years, though I was hoping/expecting it to come with DRAM replacement for main memory, switching to non-volatile and more importantly, serial rather than parallel interconnect given how crazy DDR's become for its physical demands.
768MB now(ok custom test silicon now), but imagine if that 14nm IO die had TSV cache too, the surafce area of that die could allow for as much as 1GB-1.25GB right there, Zen3++ Epyc/Threadripper could have 2GB of cache, though the IO die cache would be much slower because it would have to go through the PCB and would have to sacrifice bandwidth somewhere
@@theq4602 Through Silicon Vias that AMD used to mount 64MB of cache on top of the 5900x, chiplets(128MB of extra cache in total) they're small conductive tubes that can allow for stacking multiple chips on top of each other
Well, I do remember that when AMD showed off that prototype that had just doubled its cache, they noted how the interconnect between the extra layer had a much wider bandwidth. So putting DRAM there instead, to make it fast to transfer from DRAM to cache, thus allowing gigabytes in the chip package - yes, I can agree it points that way.
I often cluster threads*mhz*mb per price to get a linear metric to compare cpu models, that’s pretty simplified but within the same family it’s quite helpful
Well, if they spread the vcache over all the CCXs in an Epyc, that's theoretically 75% of 1GB ;) in sheer CCX L3 caches... If they add as dense a vcache to the I/O die, they could probably place between 128MB and 256MBs for I/O die cache, totalling over 1GB in single CPU socket cache ;) And that's with this version of low-layer-count vcache. The future looks great, with the CPUs possibly maybe even getting multigig options to rival GPGPU/accelerators on package memory ;) The issue I can see here is the price bracket barrier before it trickles down to everyday consumer space ;)
I've got nothing on the eggheads advancing modern processor design, but even I can how much compute goes unused / to waste due to (theoretically) benign reasons. An example of needlessly incurred waste related to this issue is the idea of serving/accessing storage via NFS: you leverage IP connections and traverse a number of connectivity types just to present some data to the CPU. Compared to a NVMe SSD, It's horrible. With that in mind, I love how IBM's decided to do design with their Power 10 architecture.
Then you don't understand why we need NFS. The most obvious one - access control. NVMe has zero security in it. nvme-of standard doesn't cover anything and expects people to figure it out on their own. Like running it inside TLS tunnel over TCP. But this still only gives you encryption and authentication and nothing about ACLs and all the other things NFS can do. After all, nvme is just a transport. Power10 is also completely different use case which can't possibly cover what we use NFS for. The inefficiency is all there and we know it. But it's there for a reason and we can only gradually try to improve on it by chipping away at the most important bits. Like, nvme-of has all the potential to replace many other protocols like iSCSI and become the de-facto standard way of accessing remote storage. But we still can't present it as it is. The storage solution may use nvme-of internally but outside it would still have to use NFS to properly cover all the usecases. Otherwise it would be useless in pretty much all environments. Like power10. It's may be very efficient inside but outside we still need NFS.
What I'm still wondering is how to keep cache coherence communication overheard in check when we have so many cached pages. CXL memory will be even another layer to keep coherent. It boggles mind just how much more complex this is getting.
CXL memory is a just a piece of memory. It cover some part of physically address space like your average memory DIMM. CPU will use the same TLB cache to increase performance because CXL memory will be mapped into virtual address space like any other RAM. Also, cache coherency doesn't care about the size of caches or RAM. Cache lines correspond to specific physical RAM addresses. CPU only needs to look at the cache line to understand where it sits in memory.
@@ricardolmendes what do you mean outdated? Caches are always up to date. It's memory that we have to keep up to date with cache by regularly flushing cache lines.
@@creker1 imagine if the same process is in 2 different CCDs how do you keep a page updated if they're changed simultaneously in two different cache banks?
@@ricardolmendes that's solved by internal CPU cache coherency agent. There's special MESI protocol that solves that. If needed, CPU will go over infinity fabric to read cache line from different CCD. Pretty much all modern processors are fully cache coherent. After that it's software problem to keep everything working correctly by using mutexes, atomic operations etc. And it's software problem to keep it fast by reducing cases where two threads fight for the same memory (that's just bad software design) or at least keep all threads inside single CCD.
I thought I was a computer nerd until I discovered ServeTheHome. At least on the hardware end of things, I don't care nearly as much about the details as this guy does. Need me to write a n-branching-tree routing algorithm? I'm your guy. Need me to explain exactly how various levels of cache improve performance? Don't care, just let me know when the product is ready for deployment.
My only question on the 3D cache is why use Silicon for the structural components? I would think that copper would be a much better option, given that you would be conducting heat away faster. Is the thermal expansion difference too much at that scale?
Do you think they will ever bring back big consoles for controlling networks? Would be cool if you could have a big Star Trek looking control surface lol.
@@deusexaethera yeah I get that but I just never see them. Sometimes you will see them in scientific or surveillance settings. Honestly I don’t even remember why I made this comment though haha.
In a year we'll have 60 million gamers in mental institutions. When game engines are updated and allow use of more cores and fast access to high speed storage they will realize that they've been coned by utube review channels and 4 core 1080p cluster suck for a decade.
it's mindblowing how fast the industry is moving right now. a good duopoly is nonstop at war, and both Intel and AMD are feeling the ever-increasing threat from ARM-based servers
Giant cache doesn't work because the time to determine a cache hit/miss, then issuing the memory access on a miss negates the benefit of cache beyond some point. The better solution is probably to implement two-tier memory with a small fast memory being on the CPU, and current DRAM-type memory outside the CPU.
Using HBM for a L4 cache does present a new problem that needs to be solved, which is increased power usage. That's something on the order of 5W per chip that you can't put into the CPU cores in a power or thermal-constrained design. More standardized exotic cooling, especially for very dense setups, might emerge as a necessary corollary. You can't just slap a heat sink onto a 64-core CPU with 8GB of HBM and depend on even the absurdly loud fans of a 1U case to keep it cool.
I remember when optimizing latency replacing cabling from one CPU cabinet (large fridge sized), under the floor, to the memory cabinet, by bolting the memory cabinet to the side of the CPU cabinet.
Many tasks only need one or two gigabytes of memory already anyway, including some CPU intensive tasks. When CPUs get up to 8GB cache, this question will become even more relevant. Do you know of any projects or pushes or plans to allow servers (and eventually desktops) to boot and run a suitable OS without needing any external DIMMs installed? Would be huge potential for cost and power consumption at the low end, for things like dedicated appliances, low end desktops, ultra portable laptops, and of course, having all that cache will already be providing the benefits at the high end
My server architecture is already based on 2 GB and a Pentium 4 HT with 2 IDE HDDs and 2 laptop HDDs on SATA-1 in total a whopping 1.21 TB :) It runs FreeBSD 13.0 on OpenZFS 2.0 with 2 external connections 1 Gbps Ethernet and Power.
@@nilswegner2881 On 64-bit systems I use Ubuntu, but I needed a reliable 32-bits system using ZFS. So I use FreeBSD since June 2019, I use it with XFCE; XRDP and Conky and it works fine. I even sometimes use Firefox 89. In general I use it for ~1 hour/week for my weekly backup with: "send | ssh receive". It takes ~1 hour, because it runs at 200 Mbps, due to a 95% CPU load on one of the CPU threads.
is that licensing-cost-per-core thing just a theoretical exercise or can you provide real examples of multi-thousand $ software licenses that get much more expensive with more cores?
SQL Server, Oracle, ANSYS LS-Dyna, VMware, Windows Server, too many to list. Feel free to start with SQL Server and go on from there. Remember that if you are using VMware/ Windows Server for virtualization you pay for core license packs per server, then you may have to pay software package license costs for applications running in the virtualized environments. So you can get hit by multiple levels of this. Often license costs greatly outweigh the server hardware.
@@ServeTheHomeVideo If costs are so significant, would it not make more sense for companies to move away from such platforms? I understand that you can't speak for the highly specialized situations where companies may find themselves into. I am approaching this as an outsider; the companies I've worked for do extensively use Windows server and Xitrix, though I've never really seen the reason why as everything we did could just as well have been done on linux.
Questions; Between Large L3 on v4 ring bus to smaller cache making room for Scalable Cores on mesh topology 1) did mesh actually overcome bus saturation; XCC, I viewed Scalable as cache starved and 2) how did Optane intermediate memory add, ease or deter access on the CPU bus side for real work loads not on the storage side I get that. 3) On Epyc side 'F' derivative LC in particular, are there tools that enable partitioning to characterize the L3 for coding specific complimentary in memory compute functions? mb
On the Epyc F inquiry can you characterize like a GPU array the processing cores to L3 memory partitions for parallel relational data processing in L3?
@@mikebruzzone9570 sorry, I have a hard time understanding your questions. Can your rephrase and elaborate a bit on them? At least 2 and 3. I understand the first one but don't have anything to say. Someone with benchmarking experience or experience profiling these CPUs might be able to answer it.
@@creker1 2) Does Optane DIMM work as persistent intermediate memory tier large addressable space. At what addressable space base DRAM 1.5 TiB v Optane 'M' 2 TiB and 'L' 4.5 TiB. What trade off or compliment between or with PiB flash arrays an added inquiry. I see in my channel research Cascade Lake M and L do sell in server, not in volume but they are seemingly addressing a need, database I suspect or large simulation including 2P workstation and/or Virtual Desktop workstation server, but anything above DRAM base 1.5 TiB for a single socket seen at Xeon W 32xxM sits in the channel and does not sell. 32xx sells as long as its 1.5 TiB base; DRAM, who mingles in Optane [?] where one step up 32xxM does not sell and sits in product for sale inventories. Why is this? Optane still to pricey for latency? vs. DRAM. Can't be economically cooled? Applications optimization required for Optane maximum addressable space? Optane seems to work for something what is it? Notation, I have no data yet for Ice lake 6 TiB Optane. 3) My Epyc 'F' and Epyc thesis generally is AMD has a tool to characterize Epyc set of cores address to defined memory partitions in L3 for in memory compute of probably a related whole computation? Parallel work loads perhaps for simulations? Weather for example. Any thoughts? mb
@@mikebruzzone9570 2) I think optane is fairly small market by as is. At least compared to RAM and Flash. Not much workloads actually need it to justify the cost. Main usage I've seen is additional cache tier. For storage systems, Optane works as fast and reliable write cache. All writes go to optane and flush to SSD/HDD in background. For databases it works as additional tier for hot data. It allows faster restart, faster processing, lower latency. But it both of these cases main memory is still a separate thing used for actual processing of data by CPU. Optane is slower than RAM. In both of these cases Optane is not being used as RAM (memory mode) but as a storage medium (app direct). I don't think using it as actual RAM is that much popular. As an engineer I don't see much use from memory mode - you don't control what goes where, everything is hidden from you. In any high performance application that's the last thing you want. But that's just my opinion. Patrick probably has real data as to which use-cases are actually popular in the wild. 3) I still don't quite understand your question. Are asking can AMD configure its F sku that has small number of cores but large L3, so that L3 acts as a RAM and all processing happens in L3? They probably can but I don't think anyone would do that. Using CPU cache as RAM is a very non-trivial mode to run in that's used during very early parts of the boot process before memory is even initialised. Even bootloader runs as little code as possible in that mode just to be able to get out of it.
With so much cache, could it be possible to modify the Linux kernel to use the cache as ram and have no ram in the system? 256MB is already enough for very small lightweight Linux distros. Probably not practical but it would be hella fast!
So it depends a bit on the way it is implemented. The old Xeon Phi x200/ x205 HPC chips could use the memory packaged alongside the compute die either as a cache for main memory or as addressable memory. You could actually run those systems without any DRAM installed in sockets.
@@tunahankaratay1523 so what you mean is that there are no opcodes meant for reading modifying or writing to cache directly? As in some autonomous coprocessor handles the caching and the user program can only interact with pages of ram?
All x86 processors are actually doing that already. When PC boots BIOS doesn't have access to RAM because memory modules are not initialized. So BIOS runs in special mode where it can use CPU cache as RAM. Maybe with some effort we can reuse that mode to allow full Linux Kernel to run like that, don't really know.
You didn't even need to go to such extremes to describe how important caches are. Even going into main memory is insanely slow by CPU standards and will destroy performance of any CPU intensive workload. Games in particular go to great lengths to optimize cache utilization by architecting whole engine around it.
Totally, but we got a lot of questions around latency with CXL so I wanted to discuss that. Also, went to the extreme just to help even less technical folks understand what is going on. Many more people understand latency to remote game servers and databases/ websites than understand memory tier latencies.
creeker, can you and or Patrick take to address my question in comments Xeon v4 to Scalable bus saturation at XCC in particular, the Optain DIMM work around and Epyc F designation tool question? Thanks. Mike
Yep. Even as a semi-amateur HPC developer I've encountered situations where it's "cheaper" - that is, faster - to regenerate data on-demand than to store and load it. I decided I needed to work on something else when I found my optimizations focusing on register spills to _cache._
16:14 2 64gigs chips yes Ik this is stupid but if they could make 1 core 2 threads ccd but with just the 1 core and rest of the silicon would be filled with sram L3 cache that you be a lot of cache
Per-core SW licenses are painful, I agree. What about . . . Cooling requirement per m^2 per IOP/frame on a GBcache-coherent basis? Cooling requirement per m^2 per IOP/frame on a HBM-coherent basis? Cooling requirement per m^2 per IOP/frame on a CXL-coherent basis? Cooling requirement per m^2 per IOP/frame on a DPU-coherent basis? Cooling requirement per m^2 per IOP/frame on a mixed and custom-tuned basis? How do the above benchmarks improve or underperform versus . . . Today's cooling requirement per m^2 per IOP/frame? Kindest regards, neighbours and friends.
The challenge with HBM is largely cost. As a result, you are more likely to see HBM paired with higher-end parts, not just the "good/ decent" elements unless they are custom designs.
The ideal system design is where the CPU and main memory run at the same speed. Imagine the performance of something like that! Currently, CPU's run much faster than main memory. I think some effort should be made on creating faster main memory. The cache approach is good but it's not a virtuous cycle. Faster main memory is. Perhaps an innovative firm will create a system with very fast main memory, modest cache size and standard processors, offering tremendous performance at a good price.
My first computer had 512MB DRAM. Soon regular computers might have more SRAM than that in the socket. Sure, IBM z15 has had 960MB for a while but it doesn't count because mainframes is a niche market.
Intel has not officially released this detail yet so we are only covering what is public in this. Of course, adding 64GB of HBM would push a CPU to have GB onboard not just MB.
Using stacked SRAM to increase cache is just too costly and even if you double the production cost and sacrifice the core architecture, the hit rate will just increase from 90 to like 92-93%. Personally I think this is stupid. Large LLC should be made with something like HBM, which is only slightly slower than the current L3(which is 40-60 cycles) but way larger(at least 8GB per package) and cheaper. People should make SRAM stick to the part that's at least 1 magnitude faster than dram, rather than just make a whole hard disk with SRAM and squeeze it into the CPU and take 3/4 of its space or thermal budget without making too much difference.
Except HBM is much slower. Its latency is even bigger than DDR4. AMD showed 15% increase just from adding cache on a non optimized probably GPU bound workload (gears 5). That's big. Last time they got such uplift is when they went from zen 2 to zen 3 with tons of microarchitectural changes made. This is just cache slapped on a die. When this goes into production and people actually start using it to optimize software they will get even bigger gains.
@@creker1 Everyone is already trying to use the cache, it's a basic part of code optimization. In the end it all comes to hit rate, and how to balance things and make the most of space and power. HBM is a large category with many types with their own design emphasises. Sometimes the eDRAM in Broadwell or the CBRAM in Phi are also called HBMs, and they both have lower latency than DDR memory. Just to say that you don't always need SRAM to make an effective cache.
@@Alan_Skywalker people don't expect big caches. So they try to keep data small, do tricks to hide latency due to cache misses etc. With big cache working set size becomes much bigger and it will definitely make an impact on software architecture. All of those cases are still just DRAM and will have comparable or bigger latencies than RAM. SRAM cache will always be significantly faster. eDRAM in broadwell was only needed due to integrated GPU. GPUs need bandwidth, they don't care about latency much. And in that case eDRAM was acting as a sort of L4 cache, not what AMD is doing here increasing L3 cache. CBRAM on Phi is a just a RAM, nothing fancy about it except that it's closer and has more bandwidth than regular DDR. Yes, you can get away without SRAM but then it's a lower tier cache and will always come after L1, L2, L3.
CAN YOU HELP US OUT? I like your multiple videos on DPU technology. I was hoping that you can help us out further using your alread extensive knowledge of DPUs and other things. The question I suggest you answer is: What is the best and least expensive path should an ordinary "Joe Basement" follow to ramp up his expertise in DPUs? I suggest you create a video looking at each DPU provider to see if they have low cost models that an ordinary Joe Basement can buy (he/she has some money but not huge amounts) to learn and become proficient with DPUs. Here is a list of questions you might want to answer in the video 1 - What companies offer a good and low-cost DPU models for low-cost leaning? 2 - What hardware is required for a minimal setup to get a DPU working? I imagine here a minimum of two servers and a switch or router. What companies and models of switches/routers and cabling would be a good choices? 3 - As you know buying the hardware is just a start. What about the software for the DPU? Does the DPU come with operating syste and software? How much can that cost for each DPU provider? 4 - Finally, I would create a rough budget for starter's kit outlined above for each DPU provider in order to compare. Such video would be most useful for many people out there wanting to learn DPU as a DevOps.
It is not super inexpensive since these are new and have the features of high-speed NICs. My sense is that if you want to learn DPUs, NVIDIA BlueField is probably the easiest to use right now.
What is written on the wall behind you? Looks like someone was writing words with their finger like people who put "Wash me" on dirty cars… I see HA, HEE, and ANA but I can't make out the rest.
Couldn't wait you'll flash that A100's :DDD P.S.: Waiting for AMD to put v-cache over 2 or more CCX to eliminate inter-ccx communication latency, U think it is viable in near future?
DDR5: Want to double the speed of your RAM? Quad channel RAM has double the through-put of dual channel RAM. 8 channel RAM has quadruple the through-put of dual channel RAM. Why would anyone want a 16 core Ryzen with dual channel ram and 24 cpu lanes? WRX80E and Threadripper Pro 3955WX has 8 channel ram and 128 GEN4 cpu lanes.
@@ServeTheHomeVideo Yes. Well looking at Ryzen 16 core Vs TR PRO 16 core there isn't much difference in today's price with the 3955WX at list. The mobo is expensive, but if your after 4x the ram channels and 5X the cpu lanes it's cheap for server grade connectivity. We're a long way from 8 channel on desktops, but the HEDT is here now.
ARM and AMD seem similar in the first place but they have one big difference. AMD is a x86 with Cache per Core. x86 likes to shuffle through quite complex instruction pipelines per core but is not so much of a team player. Well ARM is ARM with a much simpler core, if you have enough shared memory(Cache for all cores) you can split tasks into several smaller pipelines each running on one core. So in your example, you could calculate a+b for the whole data set on one core, c + e on a second, c + d on a third, and a fourth core will decide which answer is correct. As this might seem stupid in this example it can be a good feature in routing, streaming data, running simulation, and all the iterative AI stuff.
It doesn't matter if it's x86 or ARM, they all work pretty much the same. Performance oriented ARM cores are not simple by any definition. They have the same optimizations and complexity that x86 has. As for splitting tasks into smaller pipelines - that's a general approach to multithreaded algorithms and it's completely architecture independent. If you have multiple cores, you try to split your task into independent smaller tasks. Usually equal to the number of hardware threads. Or at least you limit how much is executed at any one time to the number of hardware threads. Even better is to use vector instructions. You can get orders of magnitude performance increase just from that. As for unified or split L3 cache, for good parallel workloads it doesn't matter. Each core is crunching its own set of data. As long you don't overlap and don't have any false sharing it should work the same.
I like how AMD supercharged the industry to move forward with chiplets and 3d stacking.
Imagine quad core was still what dual core was in 2010... Thanks AMD
And all because they where nearly backrupt and needed to produce a concept where most "bad silicon" is salvageable in some way or another.
They just took concepts IBM did years ago with their top tier POWER and Z processors and bring em to mainstream market.
They did work like crazy for a decade to give us Zen... meanwhile intel was hiring more advertisers to run them since they though that they would never have competition, even if AMD was telling people that Phenom was just a temporary stone to what they would release in the future... just like NVidia did...
"If you're watching this video, your computer is not doing anything useful right now"
😂 This goes out to all the sysadmins out there. 🎤 💧
I felt personally attacked
keep these vids coming, they are so valueable
With M1 beeing pretty successful with its unified memory architecture, it could be that more special purpose Server CPUs might get not only larger caches, but also in-package memory (kind of L4 cache) in conjunction with external S/DRAM or PMEM modules.
While this would technically be better for performance, it would be awful for repairability and upgradability. Want more RAM? Replace your $3000 server CPU. RAM chip died? Replace your $3000 CPU. etc. I can tolerate everything being built-in for throwaway consumer devices like smartphones, tablets, and AirBooks (though it pains me that these items are throwaways), but I can't tolerate it for big heavy servers.
@@deusexaethera it could just be an l4 cache, look up broadwell for desktop, not a replacemnt for ram just another cache layer
@@deusexaethera What about forcing the producer to make them repairable. At least for the producer. Open the package, test, unsolder and replace the defective component, close the package. Otherwise its not allowed to sell. End of story. We should not trade power efficiency for environment efficency.
Patrick, thanks for putting out some quality content. Love it.
Imagine AMD adding HBM next to the IO-die for "L4-cache"
I feel dirty now, time for a shower.
Yes please. The L4 cache on Broadwell-H crystalwell was already awesome in workloads that uses it can't imagine the awesomeness on Zen with faster HBM.
Imagine stacking HBM on an APU, just for the graphics cores. Not so dirty, IMO.
@@OTechnology I never got to play with those chips D:
@@SaberusTerras That would kind of defeat their purpose though...
Like maybe for stuff like SFF stuff like Intel has their NUC but thats about it.
HBM would make the APUs a *lot* more expensive, at which point, a dedicated GPU would make more sense.
@@FinlayDaG33k Right now it does, but costs trend down over time. There was a time when SRAM was expensive and adding L2 cache to your motherboard was optional because of it.
The first server system I worked on had 8 Mbytes of memory. Every now and then I have to pause and contemplate the massive shift that has taken place in my lifetime. I'm glad I'm mainly retired now as the speed of change is making me dizzy!
I find myself amused by the fact I paid $750 for a 135 MB hard drive back in the day. Now I can get multiple gigabyte microSD cards for lunch money.
It is estimated that Millennials will witness the equivalent of 20,000 years of pre-Industrial technological advancement in the 80-ish years we're supposed to live. No wonder so many younger people have anxiety disorders -- nothing we learn is relevant for more than 6 months; we never get to feel like we're finally up-to-speed.
My first computer had 32MB of system ram!
So cool to see more crazy ideas being done (that we know off, im sure the slicon guys have some story's).
My first had 1KB. Seems that i am old.
Mine had 4MB but the game I wanted to play needed 8MB. I had to return NBA Live 95 back to the store. :(
Except that every time CPU manufacturers figure out benefits that work around VMware licensing costs, VMware changes their licensing system. It helps for 1-3 years depending on where you are in your licensing cycle then they adapt. They are like the Borg. Soon they will count X amount of processor cache as a socket or core or something...
shhhhh, don't give the ideas!
They can just go the Oracle route, charge per core it potentially could run on.
I hope ProxMox eats into VMware . . . over time.
Kindest regards, neighbours.
I've got a feeling some supercomputers CPU nodes may be slated for later this year so they can get some flavor of Milan with 768mb of cache
NERSC Perlmutter?
@@MichaelWatersJ that’s 1 of them.
And the one I know the most about
Servers having fast innovations are the best! Stagnation will lead to quality degradation and clients will always want a fast yet efficient servers!
As a PC gamer being what got me into all of this..... server hardware is cool cause it's just the PC gaming hardware of the future, and if PC building is like Legos, wait till you see how hotswap server hw is lol 👍
Edit: and plex is the gateway drug to sysadmin like games are to computer hardware
@@EdR640 love gaming on retired server hardware. to think it served years in a farm, and now here in my humble room
Let's hope this will result in cheap used parts for home labs.
Using the TSVs, AMD could stack a unified L4 cache of SRAM right on top of the I/O die and leave the extra room on the substrate for more chiplets that actually do work. I can see a lot of workloads that would benefit from that, especially with the Intel and AMD finally moving to heterogeneous packaging and AMD buying the FPGA firm. Then you can design a line of FPGAs that have their own V-Cache and can be linked to the unified L4 allowing the FPGA and X86-64 chips to share data more efficiently. Plus with AMD’s Infinity Fabric running equal to the Memory Clock and that clock is about to accelerate by leaps and bounds as DDR5 matures. The absolutely gigantic caches will offset the latency penalty we are looking at from the move to DDR5, the FClock will (eventually) be running 4200-5000+ up from today’s paltry 1900MHz, and still at 1:1 with the MClock.
A good topic for a video might be "the future of small servers". All the changes you have described are mostly applicable for the cloud players and hugee data centers. What is the future for R720 homelabbers :)
Home labs are going to be very different. As technology changes more rapidly, the value of older gear decreases much faster. Still, servers are going to power/ size levels where they will not be suitable for home labs. Much of the home lab community today is feasting on the leftovers from the Xeon E5 stagnation era. That is why we have this series. We want to show people the inflection is coming, and soon.
@@ServeTheHomeVideo Can you expand a bit more on this perhaps? For example - do you expect home labbers to continue feasting on the old tech until these disagreggated architectures become outdated (so business as usual for the next 5 years or so) or do you expect home labs dramatically shifting away after the inflection point to the disagreggated model but scaled down to lesser needs? It would be great if you could also explain why you expect one or the other (or some third way)?
Good to see HBM finally coming to CPUs (/SoCs), been waiting for this for a few years, though I was hoping/expecting it to come with DRAM replacement for main memory, switching to non-volatile and more importantly, serial rather than parallel interconnect given how crazy DDR's become for its physical demands.
768MB now(ok custom test silicon now), but imagine if that 14nm IO die had TSV cache too, the surafce area of that die could allow for as much as 1GB-1.25GB right there, Zen3++ Epyc/Threadripper could have 2GB of cache, though the IO die cache would be much slower because it would have to go through the PCB and would have to sacrifice bandwidth somewhere
What is "TSV"
@@theq4602 Through Silicon Vias that AMD used to mount 64MB of cache on top of the 5900x, chiplets(128MB of extra cache in total) they're small conductive tubes that can allow for stacking multiple chips on top of each other
@@denvera1g1 thanks, guess I just didnt put the acronym together
Well, I do remember that when AMD showed off that prototype that had just doubled its cache, they noted how the interconnect between the extra layer had a much wider bandwidth. So putting DRAM there instead, to make it fast to transfer from DRAM to cache, thus allowing gigabytes in the chip package - yes, I can agree it points that way.
@ 0:53 a bit of magic, a vanishing CPU... EPYC !
I've been waiting for this for so long. We should have done this a while ago.
I often cluster threads*mhz*mb per price to get a linear metric to compare cpu models, that’s pretty simplified but within the same family it’s quite helpful
from 4:40 on I was wondering why Italy had flipped into a parallel mirror universe.
Great video. Learned a lot. Thanks.
Such interesting content delivered so clearly. Thanks
* describes extremely common program design * "now of course that is an absolutely crazy example..."
I was more worried about people getting enraged over my pseudocode syntax.
@@ServeTheHomeVideo I'll let you know then that I found writing "a + b = c" instead of "c = a + b" to be quite irksome...
Well, if they spread the vcache over all the CCXs in an Epyc, that's theoretically 75% of 1GB ;) in sheer CCX L3 caches... If they add as dense a vcache to the I/O die, they could probably place between 128MB and 256MBs for I/O die cache, totalling over 1GB in single CPU socket cache ;) And that's with this version of low-layer-count vcache. The future looks great, with the CPUs possibly maybe even getting multigig options to rival GPGPU/accelerators on package memory ;) The issue I can see here is the price bracket barrier before it trickles down to everyday consumer space ;)
Awesome. Thank you.
Subscribed.
Thanks!
I've got nothing on the eggheads advancing modern processor design, but even I can how much compute goes unused / to waste due to (theoretically) benign reasons. An example of needlessly incurred waste related to this issue is the idea of serving/accessing storage via NFS: you leverage IP connections and traverse a number of connectivity types just to present some data to the CPU. Compared to a NVMe SSD, It's horrible. With that in mind, I love how IBM's decided to do design with their Power 10 architecture.
Then you don't understand why we need NFS. The most obvious one - access control. NVMe has zero security in it. nvme-of standard doesn't cover anything and expects people to figure it out on their own. Like running it inside TLS tunnel over TCP. But this still only gives you encryption and authentication and nothing about ACLs and all the other things NFS can do. After all, nvme is just a transport.
Power10 is also completely different use case which can't possibly cover what we use NFS for.
The inefficiency is all there and we know it. But it's there for a reason and we can only gradually try to improve on it by chipping away at the most important bits. Like, nvme-of has all the potential to replace many other protocols like iSCSI and become the de-facto standard way of accessing remote storage. But we still can't present it as it is. The storage solution may use nvme-of internally but outside it would still have to use NFS to properly cover all the usecases. Otherwise it would be useless in pretty much all environments. Like power10. It's may be very efficient inside but outside we still need NFS.
Hello patrick :D
No this is the krusty krab
What I'm still wondering is how to keep cache coherence communication overheard in check when we have so many cached pages. CXL memory will be even another layer to keep coherent. It boggles mind just how much more complex this is getting.
CXL memory is a just a piece of memory. It cover some part of physically address space like your average memory DIMM. CPU will use the same TLB cache to increase performance because CXL memory will be mapped into virtual address space like any other RAM. Also, cache coherency doesn't care about the size of caches or RAM. Cache lines correspond to specific physical RAM addresses. CPU only needs to look at the cache line to understand where it sits in memory.
@@creker1 my question is how to update outdated cached memory in an efficient way.
@@ricardolmendes what do you mean outdated? Caches are always up to date. It's memory that we have to keep up to date with cache by regularly flushing cache lines.
@@creker1 imagine if the same process is in 2 different CCDs how do you keep a page updated if they're changed simultaneously in two different cache banks?
@@ricardolmendes that's solved by internal CPU cache coherency agent. There's special MESI protocol that solves that. If needed, CPU will go over infinity fabric to read cache line from different CCD. Pretty much all modern processors are fully cache coherent. After that it's software problem to keep everything working correctly by using mutexes, atomic operations etc. And it's software problem to keep it fast by reducing cases where two threads fight for the same memory (that's just bad software design) or at least keep all threads inside single CCD.
I thought I was a computer nerd until I discovered ServeTheHome. At least on the hardware end of things, I don't care nearly as much about the details as this guy does. Need me to write a n-branching-tree routing algorithm? I'm your guy. Need me to explain exactly how various levels of cache improve performance? Don't care, just let me know when the product is ready for deployment.
Would be pretty interesting if every single chiplet had a CPU+GPU combo on it. Great video Patrick.
My only question on the 3D cache is why use Silicon for the structural components? I would think that copper would be a much better option, given that you would be conducting heat away faster. Is the thermal expansion difference too much at that scale?
Thermal expansion rates are a big deal with this stuff.
Back on Haswell/Broadwell, some Intel CPUs had 128MB of L4 Cache. Any chance they'll go back to having L4 on either consumer or server parts?
Do you think they will ever bring back big consoles for controlling networks? Would be cool if you could have a big Star Trek looking control surface lol.
You can make a big console to control anything if you want.
@@deusexaethera yeah I get that but I just never see them. Sometimes you will see them in scientific or surveillance settings. Honestly I don’t even remember why I made this comment though haha.
In a year we'll have 60 million gamers in mental institutions. When game engines are updated and allow use of more cores and fast access to high speed storage they will realize that they've been coned by utube review channels and 4 core 1080p cluster suck for a decade.
Could have mentioned FUJITSU A64FX with its 32GB HBM2 memory
Totally. That was in the Arm + HBM section but it apparently got cut out by the editing folks. Oh well.
Intel's Knights Landing also comes to mind, though that MIC series of processors got killed off shortly afterwards
it's mindblowing how fast the industry is moving right now. a good duopoly is nonstop at war, and both Intel and AMD are feeling the ever-increasing threat from ARM-based servers
what if they stacked extra ram on top of the IO die?
Giant cache doesn't work because the time to determine a cache hit/miss, then issuing the memory access on a miss negates the benefit of cache beyond some point. The better solution is probably to implement two-tier memory with a small fast memory being on the CPU, and current DRAM-type memory outside the CPU.
Using HBM for a L4 cache does present a new problem that needs to be solved, which is increased power usage. That's something on the order of 5W per chip that you can't put into the CPU cores in a power or thermal-constrained design.
More standardized exotic cooling, especially for very dense setups, might emerge as a necessary corollary. You can't just slap a heat sink onto a 64-core CPU with 8GB of HBM and depend on even the absurdly loud fans of a 1U case to keep it cool.
I remember when optimizing latency replacing cabling from one CPU cabinet (large fridge sized), under the floor, to the memory cabinet, by bolting the memory cabinet to the side of the CPU cabinet.
Many tasks only need one or two gigabytes of memory already anyway, including some CPU intensive tasks. When CPUs get up to 8GB cache, this question will become even more relevant.
Do you know of any projects or pushes or plans to allow servers (and eventually desktops) to boot and run a suitable OS without needing any external DIMMs installed? Would be huge potential for cost and power consumption at the low end, for things like dedicated appliances, low end desktops, ultra portable laptops, and of course, having all that cache will already be providing the benefits at the high end
My server architecture is already based on 2 GB and a Pentium 4 HT with 2 IDE HDDs and 2 laptop HDDs on SATA-1 in total a whopping 1.21 TB :)
It runs FreeBSD 13.0 on OpenZFS 2.0 with 2 external connections 1 Gbps Ethernet and Power.
ZFS with 2 GB of RAM? That's Kind of hilarious... But I've done it as well and yeah it works, kinda at least
Also it's nice to See other people using FreeBSD, it makes me feel not so alien anymore
@@nilswegner2881 On 64-bit systems I use Ubuntu, but I needed a reliable 32-bits system using ZFS. So I use FreeBSD since June 2019, I use it with XFCE; XRDP and Conky and it works fine. I even sometimes use Firefox 89. In general I use it for ~1 hour/week for my weekly backup with: "send | ssh receive". It takes ~1 hour, because it runs at 200 Mbps, due to a 95% CPU load on one of the CPU threads.
Why just stop at GB? You could say it has about 0.00025 terabyte cache.
Installing Windows 95 in the CPU cache 😎
is that licensing-cost-per-core thing just a theoretical exercise or can you provide real examples of multi-thousand $ software licenses that get much more expensive with more cores?
SQL Server, Oracle, ANSYS LS-Dyna, VMware, Windows Server, too many to list. Feel free to start with SQL Server and go on from there. Remember that if you are using VMware/ Windows Server for virtualization you pay for core license packs per server, then you may have to pay software package license costs for applications running in the virtualized environments. So you can get hit by multiple levels of this. Often license costs greatly outweigh the server hardware.
@@ServeTheHomeVideo If costs are so significant, would it not make more sense for companies to move away from such platforms? I understand that you can't speak for the highly specialized situations where companies may find themselves into. I am approaching this as an outsider; the companies I've worked for do extensively use Windows server and Xitrix, though I've never really seen the reason why as everything we did could just as well have been done on linux.
Questions; Between Large L3 on v4 ring bus to smaller cache making room for Scalable Cores on mesh topology 1) did mesh actually overcome bus saturation; XCC, I viewed Scalable as cache starved and 2) how did Optane intermediate memory add, ease or deter access on the CPU bus side for real work loads not on the storage side I get that. 3) On Epyc side 'F' derivative LC in particular, are there tools that enable partitioning to characterize the L3 for coding specific complimentary in memory compute functions? mb
On the Epyc F inquiry can you characterize like a GPU array the processing cores to L3 memory partitions for parallel relational data processing in L3?
@@mikebruzzone9570 sorry, I have a hard time understanding your questions. Can your rephrase and elaborate a bit on them? At least 2 and 3. I understand the first one but don't have anything to say. Someone with benchmarking experience or experience profiling these CPUs might be able to answer it.
@@creker1 2) Does Optane DIMM work as persistent intermediate memory tier large addressable space. At what addressable space base DRAM 1.5 TiB v Optane 'M' 2 TiB and 'L' 4.5 TiB. What trade off or compliment between or with PiB flash arrays an added inquiry.
I see in my channel research Cascade Lake M and L do sell in server, not in volume but they are seemingly addressing a need, database I suspect or large simulation including 2P workstation and/or Virtual Desktop workstation server, but anything above DRAM base 1.5 TiB for a single socket seen at Xeon W 32xxM sits in the channel and does not sell. 32xx sells as long as its 1.5 TiB base; DRAM, who mingles in Optane [?] where one step up 32xxM does not sell and sits in product for sale inventories. Why is this?
Optane still to pricey for latency? vs. DRAM. Can't be economically cooled? Applications optimization required for Optane maximum addressable space? Optane seems to work for something what is it?
Notation, I have no data yet for Ice lake 6 TiB Optane.
3) My Epyc 'F' and Epyc thesis generally is AMD has a tool to characterize Epyc set of cores address to defined memory partitions in L3 for in memory compute of probably a related whole computation? Parallel work loads perhaps for simulations? Weather for example. Any thoughts?
mb
@@mikebruzzone9570 2) I think optane is fairly small market by as is. At least compared to RAM and Flash. Not much workloads actually need it to justify the cost. Main usage I've seen is additional cache tier. For storage systems, Optane works as fast and reliable write cache. All writes go to optane and flush to SSD/HDD in background. For databases it works as additional tier for hot data. It allows faster restart, faster processing, lower latency. But it both of these cases main memory is still a separate thing used for actual processing of data by CPU. Optane is slower than RAM. In both of these cases Optane is not being used as RAM (memory mode) but as a storage medium (app direct). I don't think using it as actual RAM is that much popular. As an engineer I don't see much use from memory mode - you don't control what goes where, everything is hidden from you. In any high performance application that's the last thing you want. But that's just my opinion. Patrick probably has real data as to which use-cases are actually popular in the wild.
3) I still don't quite understand your question. Are asking can AMD configure its F sku that has small number of cores but large L3, so that L3 acts as a RAM and all processing happens in L3? They probably can but I don't think anyone would do that. Using CPU cache as RAM is a very non-trivial mode to run in that's used during very early parts of the boot process before memory is even initialised. Even bootloader runs as little code as possible in that mode just to be able to get out of it.
@@creker1 Thank you for your thoughts and observations on my inquiries. mb
With so much cache, could it be possible to modify the Linux kernel to use the cache as ram and have no ram in the system? 256MB is already enough for very small lightweight Linux distros. Probably not practical but it would be hella fast!
Sadly, cache is not an addressable memory. The CPU automatically caches the important stuff from RAM.
So it depends a bit on the way it is implemented. The old Xeon Phi x200/ x205 HPC chips could use the memory packaged alongside the compute die either as a cache for main memory or as addressable memory. You could actually run those systems without any DRAM installed in sockets.
@@ServeTheHomeVideo Wow, that's cool.
@@tunahankaratay1523 so what you mean is that there are no opcodes meant for reading modifying or writing to cache directly? As in some autonomous coprocessor handles the caching and the user program can only interact with pages of ram?
All x86 processors are actually doing that already. When PC boots BIOS doesn't have access to RAM because memory modules are not initialized. So BIOS runs in special mode where it can use CPU cache as RAM. Maybe with some effort we can reuse that mode to allow full Linux Kernel to run like that, don't really know.
Maybe L4-Cache Big cache for all core.
Or maybe RAM module use cache layer on further?
You didn't even need to go to such extremes to describe how important caches are. Even going into main memory is insanely slow by CPU standards and will destroy performance of any CPU intensive workload. Games in particular go to great lengths to optimize cache utilization by architecting whole engine around it.
Totally, but we got a lot of questions around latency with CXL so I wanted to discuss that. Also, went to the extreme just to help even less technical folks understand what is going on. Many more people understand latency to remote game servers and databases/ websites than understand memory tier latencies.
creeker, can you and or Patrick take to address my question in comments Xeon v4 to Scalable bus saturation at XCC in particular, the Optain DIMM work around and Epyc F designation tool question? Thanks. Mike
Yep. Even as a semi-amateur HPC developer I've encountered situations where it's "cheaper" - that is, faster - to regenerate data on-demand than to store and load it.
I decided I needed to work on something else when I found my optimizations focusing on register spills to _cache._
16:14 2 64gigs chips yes
Ik this is stupid but if they could make 1 core 2 threads ccd but with just the 1 core and rest of the silicon would be filled with sram L3 cache that you be a lot of cache
Does it bother anyone else that the video at 5:02 is flipped?
It has to be! Talking about the return trip of data at that point.
@@ServeTheHomeVideo Ah! Had me freaked out for a minute, thinking the whole world was inside out.
Per-core SW licenses are painful, I agree.
What about . . .
Cooling requirement per m^2 per IOP/frame on a GBcache-coherent basis?
Cooling requirement per m^2 per IOP/frame on a HBM-coherent basis?
Cooling requirement per m^2 per IOP/frame on a CXL-coherent basis?
Cooling requirement per m^2 per IOP/frame on a DPU-coherent basis?
Cooling requirement per m^2 per IOP/frame on a mixed and custom-tuned basis?
How do the above benchmarks improve or underperform versus . . .
Today's cooling requirement per m^2 per IOP/frame?
Kindest regards, neighbours and friends.
how much longer till we have a good cpu with hbm and a decent gpu all in one package ?
The challenge with HBM is largely cost. As a result, you are more likely to see HBM paired with higher-end parts, not just the "good/ decent" elements unless they are custom designs.
@@ServeTheHomeVideo a guy can dream xd , do you see hbm ever scaling up to a point where its around the same price as gddr ?
Fujitsu A64FX?
1:54 Well, the Snapdragon 855+ has to work quite a bit to downscale the 2160p data. Not sure how it works, but my phone gets hot
Are we going to see the move toward say 8GB of cache & ditch the need for ram modules for mobile applications?
This is why AMD's big 3d cache is so interesting to me!
I'm not sure i follow this lego analogy. Does that mean that AMD chips will now hurt more when you step on them? Will they have teeth marks?
Ha! I brought the Legos back to my neighbors and said "here you go... in case you did not have enough to step on already."
They'll only have teeth marks if Ian Cuttress gets a hold of them.
Does make me wonder why we talk about CPU frequencies in GHz but RAM frequencies in MHz, despite them having become very comparable for a while now...
that cpu has the same amount of cache as my first graphics card had vram
The ideal system design is where the CPU and main memory run at the same speed. Imagine the performance of something like that! Currently, CPU's run much faster than main memory. I think some effort should be made on creating faster main memory. The cache approach is good but it's not a virtuous cycle. Faster main memory is. Perhaps an innovative firm will create a system with very fast main memory, modest cache size and standard processors, offering tremendous performance at a good price.
Love to see CPU cache in gbs
8:36 Next gen Mac Pro expandable memory
My first computer had 512MB DRAM. Soon regular computers might have more SRAM than that in the socket. Sure, IBM z15 has had 960MB for a while but it doesn't count because mainframes is a niche market.
Shouldn't it be c = a+b? a+b=c would produce an ASSERT error? (most of the time)
Just FYI...the upcoming Intel Xeon Sapphire Rapids server CPU has 64 GB HBM on same package.
Intel has not officially released this detail yet so we are only covering what is public in this. Of course, adding 64GB of HBM would push a CPU to have GB onboard not just MB.
Memory is the new disk
A disk is a form of computer memory.
Using stacked SRAM to increase cache is just too costly and even if you double the production cost and sacrifice the core architecture, the hit rate will just increase from 90 to like 92-93%. Personally I think this is stupid. Large LLC should be made with something like HBM, which is only slightly slower than the current L3(which is 40-60 cycles) but way larger(at least 8GB per package) and cheaper. People should make SRAM stick to the part that's at least 1 magnitude faster than dram, rather than just make a whole hard disk with SRAM and squeeze it into the CPU and take 3/4 of its space or thermal budget without making too much difference.
Mentioned this a bit in the video, but HBM is coming to CPUs.
Except HBM is much slower. Its latency is even bigger than DDR4.
AMD showed 15% increase just from adding cache on a non optimized probably GPU bound workload (gears 5). That's big. Last time they got such uplift is when they went from zen 2 to zen 3 with tons of microarchitectural changes made. This is just cache slapped on a die. When this goes into production and people actually start using it to optimize software they will get even bigger gains.
@@creker1 Everyone is already trying to use the cache, it's a basic part of code optimization. In the end it all comes to hit rate, and how to balance things and make the most of space and power.
HBM is a large category with many types with their own design emphasises. Sometimes the eDRAM in Broadwell or the CBRAM in Phi are also called HBMs, and they both have lower latency than DDR memory. Just to say that you don't always need SRAM to make an effective cache.
@@ServeTheHomeVideo Saw it ;)
@@Alan_Skywalker people don't expect big caches. So they try to keep data small, do tricks to hide latency due to cache misses etc. With big cache working set size becomes much bigger and it will definitely make an impact on software architecture.
All of those cases are still just DRAM and will have comparable or bigger latencies than RAM. SRAM cache will always be significantly faster.
eDRAM in broadwell was only needed due to integrated GPU. GPUs need bandwidth, they don't care about latency much. And in that case eDRAM was acting as a sort of L4 cache, not what AMD is doing here increasing L3 cache.
CBRAM on Phi is a just a RAM, nothing fancy about it except that it's closer and has more bandwidth than regular DDR.
Yes, you can get away without SRAM but then it's a lower tier cache and will always come after L1, L2, L3.
CAN YOU HELP US OUT?
I like your multiple videos on DPU technology. I was hoping that you can help us out further using your alread extensive knowledge of DPUs and other things. The question I suggest you answer is: What is the best and least expensive path should an ordinary "Joe Basement" follow to ramp up his expertise in DPUs? I suggest you create a video looking at each DPU provider to see if they have low cost models that an ordinary Joe Basement can buy (he/she has some money but not huge amounts) to learn and become proficient with DPUs.
Here is a list of questions you might want to answer in the video
1 - What companies offer a good and low-cost DPU models for low-cost leaning?
2 - What hardware is required for a minimal setup to get a DPU working? I imagine here a minimum of two servers and a switch or router. What companies and models of switches/routers and cabling would be a good choices?
3 - As you know buying the hardware is just a start. What about the software for the DPU? Does the DPU come with operating syste and software? How much can that cost for each DPU provider?
4 - Finally, I would create a rough budget for starter's kit outlined above for each DPU provider in order to compare.
Such video would be most useful for many people out there wanting to learn DPU as a DevOps.
It is not super inexpensive since these are new and have the features of high-speed NICs. My sense is that if you want to learn DPUs, NVIDIA BlueField is probably the easiest to use right now.
but still you or rather the cpu would have to wait for the cache to fill up anyways so whats the difference!???
Damn, thats like millions of line of a database in cpu, like holy shit
The lego analogy is confusing, can you explain it in tacos?
Tiered memory is the obvious next step, no mention yet.
What is written on the wall behind you? Looks like someone was writing words with their finger like people who put "Wash me" on dirty cars… I see HA, HEE, and ANA but I can't make out the rest.
Couldn't wait you'll flash that A100's :DDD
P.S.: Waiting for AMD to put v-cache over 2 or more CCX to eliminate inter-ccx communication latency, U think it is viable in near future?
or build cpu without cache with more cores at higher ghz
In the workloads this is designed for, more cores and more GHz will not help much
I'm still waiting 32gb HBM2e in the CPU.
DDR5: Want to double the speed of your RAM? Quad channel RAM has double the through-put of dual channel RAM. 8 channel RAM has quadruple the through-put of dual channel RAM. Why would anyone want a 16 core Ryzen with dual channel ram and 24 cpu lanes? WRX80E and Threadripper Pro 3955WX has 8 channel ram and 128 GEN4 cpu lanes.
Well a 3995WX uses a lot more power and costs a lot more.
@@ServeTheHomeVideo Yes. Well looking at Ryzen 16 core Vs TR PRO 16 core there isn't much difference in today's price with the 3955WX at list. The mobo is expensive, but if your after 4x the ram channels and 5X the cpu lanes it's cheap for server grade connectivity. We're a long way from 8 channel on desktops, but the HEDT is here now.
@@maxhughes5687 Agreed.
I welcome 8 channel on desktop . . . like yesterday.
Wait this ain't Level1Tech.
ARM and AMD seem similar in the first place but they have one big difference. AMD is a x86 with Cache per Core. x86 likes to shuffle through quite complex instruction pipelines per core but is not so much of a team player. Well ARM is ARM with a much simpler core, if you have enough shared memory(Cache for all cores) you can split tasks into several smaller pipelines each running on one core. So in your example, you could calculate a+b for the whole data set on one core, c + e on a second, c + d on a third, and a fourth core will decide which answer is correct. As this might seem stupid in this example it can be a good feature in routing, streaming data, running simulation, and all the iterative AI stuff.
It doesn't matter if it's x86 or ARM, they all work pretty much the same. Performance oriented ARM cores are not simple by any definition. They have the same optimizations and complexity that x86 has. As for splitting tasks into smaller pipelines - that's a general approach to multithreaded algorithms and it's completely architecture independent. If you have multiple cores, you try to split your task into independent smaller tasks. Usually equal to the number of hardware threads. Or at least you limit how much is executed at any one time to the number of hardware threads. Even better is to use vector instructions. You can get orders of magnitude performance increase just from that. As for unified or split L3 cache, for good parallel workloads it doesn't matter. Each core is crunching its own set of data. As long you don't overlap and don't have any false sharing it should work the same.
Maybe the software will be licensed per MB of cash and not per core, just like they moved vom sockets to cores
1Tb of HBM3.0!! or a CXL2.x plug
What happens if c==0? ;)
Thumbnail on MB-era looks like Hollywood's Mexican movie filter.
I now understand why Apple's M1 cpu is so fast. They have all available ram on the cpu die and further, the SSD is also much faster.
Cachedisk when?
These EPIC CPU's have an order of magnitude more cache then my first PC's RAM.
I think it will be significantly longer than 2 years for 64 cores to become insufficient.
The age of RAMDISK is going to be great.
GIGABYTES OF CACHE WHAAAAATTT
From mega bruh to giga bruh?
We use Mega-bros or Mega-broskis :)
lego better than limes.
I need to thank my neighbors for the help with these Legos. I had the idea on this one and went next-door and they helped me get all the parts.
Gigabyte? Pun intended 🤣
Apple already ships processors with 16 GB of onboard cache
great teeth
sorry, "geniuses" messed all things up
system on chip
complete expandable system
money trade is stupid, coin for coin
chuck 48GBytes on chip
Bad video title IMO, should include 'cache'
Technically it may not just be SRAM cache but we discuss HBM too. Stacked DRAM in-package is possible but there was not a recent disclosure on that.
x86 is dying a slow slow death.
Hear hear