Great work! I would feel like I was condescending to people, explaining things this way, but it's actually just nice. Do you know what age your audience are? It seems like a 3rd cache video about concurrent writes and thread co-ordination would follow naturally? ;) Useful to anyone writing parallel code and perhaps the biggest gotcha of cache lines?
The L1 cache actually gets weirder - they often report the L1 _data_ cache alone, and in particular the 3990X (or most other Zen 2 chips) has 32kB instruction cache, for a total of 64kB L1 cache per core = _4MB_ of L1 cache.
Excellent video, Mate! I know you're a talented musician too. Would be so cool, since you've upped the video editing stakes recently, to include some of it (did I hear singing at the end there?)
Yes, some are! GPU caches are certainly programmable. Just architecture dependent I guess. I wish they were programmable in x64, that would be great fun :)
Definitely not gaming. Their ccx / ccd layout introduces too much latency to be top tier at gaming. It's best use case is for cloud compute (think VPS) with shared core access. This would keep all CPUs busy and the constant context switching between container processes is a very valid reason to need 300MB total cache.
Yes, they're certainly more than I generally need, I play Street Fighter and DooM - an i3 do just fine :) I'd love to code these Ryzens! Just to see it eat some AVX. Or race the Ryzen Cores against Intel's AVX512. The Ryzens would make fine server chips, great for Virtual machines and that kind of thing.
@@WhatsACreel i've just bought a cheap epyc 7401 with 24core for the some reason , after your reply i understood that does not fit well for latency applications but for throughput
The high core count threadrippers are often used for workstations for doing scientific work and rendering. Like mentioned elsewhere the latencies tend to get pretty high, that's just the nature of making such a large CPU it happens too on the intel 18 and 28 core parts, so generally you want to give it workloads that are well spread out and have few dependencies between the threads.
Is there any way as a programmer to control (directly or indirectly) which data is present at which point in each of the cache levels? Or is that only regulated by the OS or the hardware itself?
can you please answer this question ? I know the answer but cannot get at it - I mean how to work it out.A processor has 30-bit Virtual addresses, 64 Mbyte physical memory organised in 256 Kbyte pages, and a 128 Bytes, 4-way associative TLB with 4 Bytes block size. How big is the Page table? (ignore the status bits) and how many bits are in TLB tag and index?
Ah, the amazing AMD bulldog. Where the cpu specs look like it should be doing great; but in reality the chip is sitting there doing nothing 50% of the time; because it's waiting for memory accesses and the instruction decoder to give it something to do.
The Architecture im working on doesn't have Cores, and it one cycle to access the entire data storage, and its closer than L1, physical threads are one per byte of data storage
When they have 256 megs of cache why there is no instruction to lock some memory regions into L3 cache completely? I mean... Literally I usually arch linux with dwm and browser with 4 tabs I am below 256Mb ram usage... Now okay that is not likely what a "regular user" do in win 10 chrome and endless ram trashing, but I was thinking: If you run a game fullscreen, shouldn't its physics engine or maybe at least ones GPU / kernel drivers mark themselves to not fall out of L3 at all? Seems like this kind of approach would also make for a faster L3 load times for these marked areas because not needing this assiciativity anymore for those areas that much. My question is this: why don't give more control to the programmers at least on driver writers level? I actually tinkering with fpga sometimes and even on that low gate count stuff sometimes I come up with this idea to just make "caches" act like how texture memory is with GPUs: let the coder load them up and control more directly. Then no fancy associativity is needed - or that can handle regions not specified / locked... Or maybe can be a good idea for a smaller L4 cache to work like this and let programmers feed stuff into that and refer to it via some kind of "handle"?
I guess the closest thing to a "keep in cache" marker would be the prefetch* instruction/ _mm_prefetch() intrinsic. You'll spend a few cycles looping through all the cache lines you need, but it makes it a lot more predictable that your frequently/likely accessed data will remain at the top of the list. It would be cool to have some addressable on-die static ram though. It would increase the cost of context switching significantly, so the OS would have to play well with it, but if your process can have a core to itself for a while, it would be a great tool.
AMD settled in the lawsuit, it never made it to court. They should have fought it on principle but it was just cheaper to settle. The other issue is tech lawsuits are generally completely nonsense. The court system typically rules very poorly when it comes to anything scientific or technological. Explaining tech to a judge and/or jury, imagine explaining this argument to your elderly aunt, grandmother, Uncle Bob, etc, and expecting them to make a decision on this... The FX was competitive with the Phenom II on a core to core basis. If it was actually a quad core then you'd compare the FX 8000 series to a Phenom II x4 and it would destroy the Phenom II. The chip technically has an FPU per core anyway, it's a split module that can run individually and combined. The real problem was no one optimized for the arch. In apps that would fit in the L3 it was faster than an Ivy Bridge Xeon with HT (got an FX and a Xeon here).
5:13 A CPU whose L2 cache can act as a 4 way _and_ a 12 way associative set cache? Sign me up!
HA!! Well spied :)
Great work! I would feel like I was condescending to people, explaining things this way, but it's actually just nice. Do you know what age your audience are?
It seems like a 3rd cache video about concurrent writes and thread co-ordination would follow naturally? ;)
Useful to anyone writing parallel code and perhaps the biggest gotcha of cache lines?
Yay, you're back! How's the quarantine going for you?
Having lots of fun making vids! Cheers for stopping by mate :)
That was a great and insightful mini-series, mate!
Any comment on the cache organization in the new Ryzen 9 7950X3D with the "3D" cache?
Brilliant animations and explanation.
The L1 cache actually gets weirder - they often report the L1 _data_ cache alone, and in particular the 3990X (or most other Zen 2 chips) has 32kB instruction cache, for a total of 64kB L1 cache per core = _4MB_ of L1 cache.
It has as much L1 as the phenom has L3! Granted it also cost 20x as much at launch...
Excellent video, Mate! I know you're a talented musician too. Would be so cool, since you've upped the video editing stakes recently, to include some of it (did I hear singing at the end there?)
It would be great fun to record some music! Thanks for the suggestion mate, and thanks for watching :)
10:05 "Also really curious what Intel comes up with to defend themselves against this _risin'_ threat." I see what you did there ;)
Thank you for video.
As far as I know, some cache modes and the way it works is programmable in ARM CPU's (embedded models).
Yes, some are! GPU caches are certainly programmable. Just architecture dependent I guess. I wish they were programmable in x64, that would be great fun :)
What are thread rippers even used for? They're absolutely amazing but such an overkill and so expensive! Gaming? Big servers?
Definitely not gaming. Their ccx / ccd layout introduces too much latency to be top tier at gaming. It's best use case is for cloud compute (think VPS) with shared core access. This would keep all CPUs busy and the constant context switching between container processes is a very valid reason to need 300MB total cache.
Yes, they're certainly more than I generally need, I play Street Fighter and DooM - an i3 do just fine :) I'd love to code these Ryzens! Just to see it eat some AVX. Or race the Ryzen Cores against Intel's AVX512. The Ryzens would make fine server chips, great for Virtual machines and that kind of thing.
Really? I'd have thought they would eat games for breakfast! That's great info, cheers for sharing :)
@@WhatsACreel i've just bought a cheap epyc 7401 with 24core for the some reason , after your reply i understood that does not fit well for latency applications but for throughput
The high core count threadrippers are often used for workstations for doing scientific work and rendering. Like mentioned elsewhere the latencies tend to get pretty high, that's just the nature of making such a large CPU it happens too on the intel 18 and 28 core parts, so generally you want to give it workloads that are well spread out and have few dependencies between the threads.
@Creel how did you find the cache access latencies for all these chips? thanks.
the rise in price at launch is scary
Is there any way as a programmer to control (directly or indirectly) which data is present at which point in each of the cache levels? Or is that only regulated by the OS or the hardware itself?
If those L3 caches get much bigger, we may as well just remove RAM entirely. Those are some *huge* caches.
Either that, or rename RAM to "L4 cache".
Still using an i7 2600 in 2021, and its performance is still pretty decent.
good comparison, maybe also servers cpus should be here to see if is only memory or other specs betterr
Great point, server CPU's tend to have more cores and ram but lower frequency. Cheers for the suggestion, and cheers for watching :)
can you please answer this question ? I know the answer but cannot get at it - I mean how to work it out.A processor has 30-bit Virtual addresses, 64 Mbyte physical memory organised in 256 Kbyte pages, and a 128 Bytes, 4-way associative TLB with 4 Bytes block size. How big is the Page table? (ignore the status bits) and how many bits are in TLB tag and index?
Ah, the amazing AMD bulldog.
Where the cpu specs look like it should be doing great;
but in reality the chip is sitting there doing nothing 50% of the time;
because it's waiting for memory accesses and the instruction decoder to give it something to do.
What about "trace cache"? What was that all about?
The Architecture im working on doesn't have Cores, and it one cycle to access the entire data storage, and its closer than L1, physical threads are one per byte of data storage
waiting for 1gb of L3
at this rate we can ditch DRAM entirely
History suggests the opposite. The memory heiarchy will get deeper not shallower.
When they have 256 megs of cache why there is no instruction to lock some memory regions into L3 cache completely?
I mean... Literally I usually arch linux with dwm and browser with 4 tabs I am below 256Mb ram usage... Now okay that is not likely what a "regular user" do in win 10 chrome and endless ram trashing, but I was thinking: If you run a game fullscreen, shouldn't its physics engine or maybe at least ones GPU / kernel drivers mark themselves to not fall out of L3 at all? Seems like this kind of approach would also make for a faster L3 load times for these marked areas because not needing this assiciativity anymore for those areas that much.
My question is this: why don't give more control to the programmers at least on driver writers level? I actually tinkering with fpga sometimes and even on that low gate count stuff sometimes I come up with this idea to just make "caches" act like how texture memory is with GPUs: let the coder load them up and control more directly. Then no fancy associativity is needed - or that can handle regions not specified / locked... Or maybe can be a good idea for a smaller L4 cache to work like this and let programmers feed stuff into that and refer to it via some kind of "handle"?
I guess the closest thing to a "keep in cache" marker would be the prefetch* instruction/ _mm_prefetch() intrinsic. You'll spend a few cycles looping through all the cache lines you need, but it makes it a lot more predictable that your frequently/likely accessed data will remain at the top of the list. It would be cool to have some addressable on-die static ram though. It would increase the cost of context switching significantly, so the OS would have to play well with it, but if your process can have a core to itself for a while, it would be a great tool.
AMD settled in the lawsuit, it never made it to court.
They should have fought it on principle but it was just cheaper to settle. The other issue is tech lawsuits are generally completely nonsense. The court system typically rules very poorly when it comes to anything scientific or technological. Explaining tech to a judge and/or jury, imagine explaining this argument to your elderly aunt, grandmother, Uncle Bob, etc, and expecting them to make a decision on this...
The FX was competitive with the Phenom II on a core to core basis. If it was actually a quad core then you'd compare the FX 8000 series to a Phenom II x4 and it would destroy the Phenom II.
The chip technically has an FPU per core anyway, it's a split module that can run individually and combined. The real problem was no one optimized for the arch. In apps that would fit in the L3 it was faster than an Ivy Bridge Xeon with HT (got an FX and a Xeon here).
They were dodging bankruptcy and would have lost anyways. It was a blatant marketing scam.
No one optimized for it because it was an awful idea.
Part 1: ua-cam.com/video/UCK-0fCchmY/v-deo.html
AMD should have made a Phenom |||, and figure out something actually good-for-real-life, for the Bulldozer in the meantime.
Thanks for reading spec sheet to me