The Anatomy of a Modern CPU Cache Hierarchy

Поділитися
Вставка
  • Опубліковано 14 гру 2024

КОМЕНТАРІ • 32

  • @TheUpriseConvention
    @TheUpriseConvention 4 дні тому +10

    Thank you so much for your videos! I’m currently a machine learning engineer trying to cover the computer science theory I never learnt at school. These videos are a goldmine!

  • @KrisRyanStallard
    @KrisRyanStallard 4 дні тому +2

    Excellent video. Informative without getting bogged down in too many unnecessary details

  • @harshnj
    @harshnj 3 дні тому +4

    You have been subscribed.
    Just don't stop making this quality videos

  • @szymonozog7862
    @szymonozog7862 4 дні тому +2

    Love the series so far, keep it going!

  • @mateuszpragnacy8327
    @mateuszpragnacy8327 4 дні тому +6

    Really good videos. It is really helping me design cache for my Minecraft cpu ❤

  • @stefanopilone957
    @stefanopilone957 4 дні тому +3

    thanks, very clear, I liked & subscribed

  • @dj.yacine
    @dj.yacine 4 дні тому +3

    Thanks 👍. high quality 💯

  • @abunapha
    @abunapha 3 дні тому

    תודה

  • @chetan_naik
    @chetan_naik 4 дні тому +4

    Informative video but why stop at L3 or L4 cache why not add L5, L6, L7 and so on cache to improve performance?

    • @turner7777
      @turner7777 4 дні тому +11

      Probably diminishing returns with increased cost and complexity

    • @der.Schtefan
      @der.Schtefan 4 дні тому +8

      It's the way how memory is implemented. By the time you are past L3, the latency starts being almost as "slow" as main system ram on a memory bus. L1 is usually very expensive and space eating SRAM

    • @jedijackattack3594
      @jedijackattack3594 3 дні тому +2

      This has been done before. Intel did a L4 cache on broadwell for certain C cpu chips using a big external Sram die.
      The first problem is that Cache is rather expensive. Die cost scale exponentially with die size so a 100mm^2 die is 4x as expensive as a 50mm^2. And modern CPU cores are actually quite small zen 5 is only around 4mm^2 but the on the full zen 5 CCD half the die area is that 40 MBi of L3 cache and the 8MBi of L2. And as an additional problem thanks to the high clock speeds, as the size of the cache increases the latency will increase as well just moving the data from the cache back to the processor.
      Cache is also quite power hungry even when off so they tend to want to minimise it for consumer platforms, especially if it is going to be idle a lot like a phone or as intel did allowing the whole cache to be powered off on the performance cores.
      As for why we still don't see them trying with L4 or L5 cache, there are a lot of so called cache unfriendly workloads. These work loads tend to have a few things in common. Low levels of exploitable instruction level parrallelism, high levels of random branchs (especially consecutive branches) and a large random and sparsely accessed dataset. Doing these things tends to result in a processor being unable to effectively speculate ahead or reorder instructions to hide latency leaving it purely at the mercy of the memory subsystem to determine how long the stall will be. Thanks to the random sparse accesses the caches are unlikely to contain the right data as data is being trashed and discarded constantly and is unlikely to be prefetched correctly thanks to the randomness. A bigger cache may allow you to brute force this problem as AMD has proven with the X3D line of chips but if the data set is still sufficently bigger than cache, that hit rate is not improvement, you will see no improvement in performance. And if your new big cache doesn't have enough bandwidth to feed the improvement in hit rates you will also have a problem where the cores end up stalling waiting to the cache to actually get around to servicing their request. Making a cache higher bandwidth makes it bigger and more expensive.
      This is part of the reason that a lot of modern CPU optimisations and a lot of HPC software optimisation focuses on how to make sure the data and istruction stream is as predicatable and cache friendly as possible.

    • @chetan_naik
      @chetan_naik 3 дні тому

      @@jedijackattack3594 Well explained, I also wonder when cache miss occurs would the latency be just RAM latency or RAM latency + latencies of all the cache levels combined?

    • @jyotiradityasatpathy3546
      @jyotiradityasatpathy3546 3 дні тому

      ​@@chetan_naik depends on the access architecture, usually serial and not parallel, which means the latencies would be added. However a main memory access time is far far larger than a register file access

  • @der.Schtefan
    @der.Schtefan 4 дні тому +7

    You did not explain associativity.

  • @stachowi
    @stachowi 2 дні тому

    very good (and to the point).

  • @abskrnjn
    @abskrnjn 2 дні тому

    How did you made this video, cool visuals

  • @anonymoususerinterface
    @anonymoususerinterface 2 дні тому

    Can I ask where you get this knowledge from? I would like to know more!

    • @BitLemonSoftware
      @BitLemonSoftware  2 дні тому

      My own knowledge as software/firmware engineer + research I do for each video. You can see the sources I used in the description

  • @hatsuneadc
    @hatsuneadc 16 годин тому

    What happens if cache is not available yet in the L3 when another core tries to access it? Does it wait for it to propagate? Or does it take the last known (old) state?

    • @BitLemonSoftware
      @BitLemonSoftware  8 годин тому

      I didn't fully understand the question. In any case, a cache will never pass stale values to the processor core.

  • @mikevirutal79
    @mikevirutal79 2 дні тому

    great video. do you have courses in udemy?