You really went on another level! Those new simulations stimulated my brain to the point where I stored all this information in my L1 cache. Thanks for the great video once again!
Wow! Gilad Reich wrote all that needs writing about the awesome presentation. Can't wait for part 2. I am hoping you might conclude with an alignment strategy to get the most efficient use of the caches. We are certainly not in Kansas anymore, Toto. Great video.
Really glad you liked it! I actually recorded both vids in a single take, but split it into 2 because the second half is seemed liked a different video. It's just a commentary on a handful of hardware specs. It would be fun to discuss alignment strategies, and particularly access patterns! Maybe in an upcoming video? Thanks for watching :)
also real CPUs will have to synchronization between cores. in situations like core0 has data from address 0x01230123 and core1 stores to address that is in same cache block as 0x01230123. now core0 has invalid/old data it's cache. What happens next depends on the ISA (how relaxed is the memory model and stuff) but, if remember correctly on x86 the invalid/old cache data needs to be reloaded to cache from main memory by core0 when it tryis to access it. also the c/c++ memory model (more relaxed than x86) has some opinions about this and this effects how compilers are allowed to generate code for loads and stores.
@@WhatsACreel Actually, there is a protocol for multiple cores CPU, which is the MESI protocol. Basically, Intel and AMD have their unique protocol based on the MESI.
Another thing that is similar but different within CPU ISA's is when it comes to their function - virtual - routine tables from accessing data from the disk drive... There is sort of a cache structure there as well, except the information can be hashed into a virtual lookup table.
Really an amazing and clear explanation, great animations too In this example we assume there isn't any virtualization right? All those addresses would be physical addresses
Cheers mate! I actually recorded one long video, but decided to split it into two because the second half was different. It's just a chat about some specs from Intel and AMD CPU's. It would be fun to continue with some more info on caches, dirty bits, exclusive v inclusive, victim caches, etc. And the instruction cache, which is a different beast all together! Anywho, thanks for watching :)
Better compilers could probably eliminate the hardware automatic caching system and precache code and data in an optimised way. Same for the OS / app runtime dynamic memory manager. Currently it isn't possible to access a cache directly (in X86 at least) but it is possible to precache data. If you could access a cache directly it would save the memory address translation step the hardware has to perform.
I'd love to be able to use Ternary Content Addressable Memory (TCAM) for everything. I just wish TCAM wasn't so expensive and power hungry, and that the storage densities were actually half-decent.
When you have L1, L2 and L3 cache, isn't data from L1 pushed into L2 when new data comes in? And if the data in L2 gets old, it is moved to L3? Something like that anyhow. My memory on this is fuzzy. Anyhow, I seen some great videos on coding your programs to maximize cache hits. The code to do this can often look slower with more code, but the end result will be a huge speed increase. I forget where I seen the video now, but was REALLY fascinating to see normal code, verses code which has been designed to maximize cache hits.
Yes, the caches generally evict to higher levels. It might be fun to make a video on exclusive vs inclusive and victim caches! All that stuff is great :) Techniques called cache tiling/blocking are great! Keep the data being processed in the L1!! Cheers for watching mate :)
I dont understand, in the start animation he has 4 sets and 4 ways. Is one yellow block a cache line or do all the yellow blocks together make up a cache line?
How exactly does the comparison with tags work? If a set is full, are all these tags going to be compared in parallel or does it work like a binary search?
If my program makes a sequential access from beginning to end of some large array, can CPU predict that it will need data from more than just one cache line and start loading the following ones in advance?
That depends on a few other things... It isn't just the hardware and its opcodes, but it also depends on the OS and on your Compiler - Interpreter and how they convert your source code to either assembly, byte codes, or opcodes... There are many optimizations that your Compiler - Interpreter will make depending on your compiler's - interpreter's command-line options and settings... Then it comes down to the architecture and its hardware design for which features are available. After that, it then depends on your Operating System and how it handles the calls to the underlying hardware such as reading and writing to disk, creating threads and semaphores, reading and writing to ports, etc.
@@skilz8098 ...Are you sure you know what you're talking about? Never mind, I found out that modern x86 processors do, in fact, have automatic prefetch mechanisms which can detect linear access patterns.
I thnk they call it smart prefetch at AMD or hardware prefetch at Intel? They certainly do this with the instruction cache too! Compilers will use software prefetch if they're clever enough! Certainly an interesting topic! Cheers for watching :)
@@captainbodyshot2839 I wasn't trying to be too explicit because you would have to read the datasheets, and the ISA manuals to get all of the details. And the available features and techniques that can be used vary from architecture (cpu), platform(os), and compiler. Take, for example, you and I could have the same exact hardware and operating system except I could be using Visual Studio and you could be using GCC or Clang for C++. They all work very similarly and they usually implement 98%+ of the C++ standard, but they may do so in different manners. Compiler A might use register X with instruction 1 where Compiler B might use register Y with instruction 2 to generate the same algorithm.
@@skilz8098 pre-fetching happens at the hardware level, not the software level. An executable/assembly can't tell a processor where to put data in the caches. Different compilers might result in different assembly which may result in the processor handling memory differently among the caches. However, processors are either using this technique or they're not, regardless of your assembly code. These days most processors do it.
You missed the policy of a "dirty" cache, where data was written to a cache but wasn't synced with RAM when it's evicted....but other than that, pretty much got it.
You really went on another level! Those new simulations stimulated my brain to the point where I stored all this information in my L1 cache. Thanks for the great video once again!
Hahaha! Cheers for watching :)
What is it with the Reichs and their interest in computing
this video was literally more useful than my entire semester...i'm speechless
then you didnt listeb
You are way better at explaining this than my university teachers were! The graphics are a huge help. Thank you, this is helping tons of people!
The Disruptor circular buffer makes use of cache characteristics for speed. Thank you for this great explanation of the process!
Omg. So clearly explained. This is very good teahing material. Good job and thank you very much!
Thank you SirUniverse :)
You did a really good job containing the animations, didactics and enthusiasm for this subject! Thank you!
🤯 such a brilliant explanation. Never knew how caches worked, and not sure when I will ever need to know, but it's fascinating stuff.
I can't describe how much greatful am I
thank you prof!
Wow! Gilad Reich wrote all that needs writing about the awesome presentation. Can't wait for part 2. I am hoping you might conclude with an alignment strategy to get the most efficient use of the caches. We are certainly not in Kansas anymore, Toto. Great video.
Really glad you liked it! I actually recorded both vids in a single take, but split it into 2 because the second half is seemed liked a different video. It's just a commentary on a handful of hardware specs. It would be fun to discuss alignment strategies, and particularly access patterns! Maybe in an upcoming video? Thanks for watching :)
Awesome vid btw, your CUDA tutorials got me thru a semester. Set associative seems like a cuckoo table without hashing.
Great video! I was not expecting to see animations for in this and that was a pleasant surprise! Helped with the explanation a lot too!
Really useful and awesome video. Clear and concise, with good examples!
Awesome video man! You're going to blow up with such fantastic content!
This video is absolutely brilliant.
also real CPUs will have to synchronization between cores. in situations like core0 has data from address 0x01230123 and core1 stores to address that is in same cache block as 0x01230123. now core0 has invalid/old data it's cache. What happens next depends on the ISA (how relaxed is the memory model and stuff) but, if remember correctly on x86 the invalid/old cache data needs to be reloaded to cache from main memory by core0 when it tryis to access it. also the c/c++ memory model (more relaxed than x86) has some opinions about this and this effects how compilers are allowed to generate code for loads and stores.
They do indeed! Synchronization between cores is a great topic!
@@WhatsACreel Actually, there is a protocol for multiple cores CPU, which is the MESI protocol. Basically, Intel and AMD have their unique protocol based on the MESI.
Great stuff, keep it up!
Cheers mate! Thanks for watching :)
Incredible material. Many thanks.
Another thing that is similar but different within CPU ISA's is when it comes to their function - virtual - routine tables from accessing data from the disk drive... There is sort of a cache structure there as well, except the information can be hashed into a virtual lookup table.
Very nice visualization. Super helpful!
Amazingly done! A very clear explanation, thanks Creel! :D
This was an amazing explanation!
This video is great, although UA-cam's low bitrate kind of ruins those nice 3D renders. Perhaps you could render at a higher res? Cheers
love the skeletor thing
Really an amazing and clear explanation, great animations too
In this example we assume there isn't any virtualization right? All those addresses would be physical addresses
HOly visualizations, Batman! This is great!
Awesome graphics and to the point explanation! Thanks!
In the next vid will you teach about dirty bits and how the CPU is notified of a change made to ram by another component i.e. the GPU or the disk
Cheers mate! I actually recorded one long video, but decided to split it into two because the second half was different. It's just a chat about some specs from Intel and AMD CPU's. It would be fun to continue with some more info on caches, dirty bits, exclusive v inclusive, victim caches, etc. And the instruction cache, which is a different beast all together! Anywho, thanks for watching :)
This was so well made and explained!
brilliant
great explanation!
Glad it was helpful!
Thats a really great explaination!
THANK U now i actually understand this 4 my final
Fantastic video. Thank you Creel!
Better compilers could probably eliminate the hardware automatic caching system and precache code and data in an optimised way. Same for the OS / app runtime dynamic memory manager. Currently it isn't possible to access a cache directly (in X86 at least) but it is possible to precache data. If you could access a cache directly it would save the memory address translation step the hardware has to perform.
Thanks for this.
Welcome, cheers for watching :)
I'd love to be able to use Ternary Content Addressable Memory (TCAM) for everything.
I just wish TCAM wasn't so expensive and power hungry, and that the storage densities were actually half-decent.
When you have L1, L2 and L3 cache, isn't data from L1 pushed into L2 when new data comes in? And if the data in L2 gets old, it is moved to L3? Something like that anyhow. My memory on this is fuzzy. Anyhow, I seen some great videos on coding your programs to maximize cache hits. The code to do this can often look slower with more code, but the end result will be a huge speed increase. I forget where I seen the video now, but was REALLY fascinating to see normal code, verses code which has been designed to maximize cache hits.
Yes, the caches generally evict to higher levels. It might be fun to make a video on exclusive vs inclusive and victim caches! All that stuff is great :)
Techniques called cache tiling/blocking are great! Keep the data being processed in the L1!!
Cheers for watching mate :)
Beautiful - thank you!
Amazing video !
wow , so good - have you done anything on virtual memory please?
Thank you! Great video!
I dont understand, in the start animation he has 4 sets and 4 ways. Is one yellow block a cache line or do all the yellow blocks together make up a cache line?
My guy
It helped a lot. Thank you.
what happens to the cache line that gets evicted from L1 ? does it get written into L2 and what is that process look like ?
king
THANK U !!!
How exactly does the comparison with tags work? If a set is full, are all these tags going to be compared in parallel or does it work like a binary search?
Given how hardware is naturally parallel I would assume parallel.
It is better for you if I let the advertising run all the way to the end?
Ha! I'm not sure... Nice of you to think of that tho! Thanks for watching :)
The one dislike is from intel 😆
Wow, your icon is animated in the notifications... Is it a gif? How did you do that? Hahaha :)
If my program makes a sequential access from beginning to end of some large array, can CPU predict that it will need data from more than just one cache line and start loading the following ones in advance?
That depends on a few other things... It isn't just the hardware and its opcodes, but it also depends on the OS and on your Compiler - Interpreter and how they convert your source code to either assembly, byte codes, or opcodes... There are many optimizations that your Compiler - Interpreter will make depending on your compiler's - interpreter's command-line options and settings... Then it comes down to the architecture and its hardware design for which features are available. After that, it then depends on your Operating System and how it handles the calls to the underlying hardware such as reading and writing to disk, creating threads and semaphores, reading and writing to ports, etc.
@@skilz8098 ...Are you sure you know what you're talking about? Never mind, I found out that modern x86 processors do, in fact, have automatic prefetch mechanisms which can detect linear access patterns.
I thnk they call it smart prefetch at AMD or hardware prefetch at Intel? They certainly do this with the instruction cache too! Compilers will use software prefetch if they're clever enough! Certainly an interesting topic! Cheers for watching :)
@@captainbodyshot2839 I wasn't trying to be too explicit because you would have to read the datasheets, and the ISA manuals to get all of the details. And the available features and techniques that can be used vary from architecture (cpu), platform(os), and compiler.
Take, for example, you and I could have the same exact hardware and operating system except I could be using Visual Studio and you could be using GCC or Clang for C++. They all work very similarly and they usually implement 98%+ of the C++ standard, but they may do so in different manners.
Compiler A might use register X with instruction 1 where Compiler B might use register Y with instruction 2 to generate the same algorithm.
@@skilz8098 pre-fetching happens at the hardware level, not the software level. An executable/assembly can't tell a processor where to put data in the caches. Different compilers might result in different assembly which may result in the processor handling memory differently among the caches. However, processors are either using this technique or they're not, regardless of your assembly code. These days most processors do it.
0:56, there. And 3:55 also.
how did the offset read 9?
Could you turn on subtitles for this video? Thanks
You missed the policy of a "dirty" cache, where data was written to a cache but wasn't synced with RAM when it's evicted....but other than that, pretty much got it.
Fucking amazing!
What is your accent? English pirate? Great content btw
Im curious about the 'Valid Bit'. I was told that there must be one valid bit too, could someone tell me what happened to it? haha
My architecture im working on gets rid of the cache principles, your entire storage space would be more like level 0, faster than L1
Part 2: ua-cam.com/video/tde8lhFdczI/v-deo.html
I know this is going to sound silly, but could you work in something about throwing a shrimp onto a barbie
These are not the cache lines you’re looking for.