Please keep having Casey on even if its more "eat your vegetables" than "JavaScript junk food" content. I learn so much every time I listen to this guy talk
Loved the explanation of how L1 cache works. Prime's perspective as someone who isn't knowledgeable about this topic helped me better understand Casey's explanation. Would totally watch a regular show or podcast where casey explains to prime how things work down at the hardware level. It was anything but boring! Thanks to you two for doing this one
i know lengthy, deep (that's what she said) explanations might be boring for a lot of people and not great to do on stream, but I want to say I really enjoy those. it takes the edge off of all the abstraction we're submerged in every day and it actually feels like computer science. I can't apply anything of what Casey said, but I loved every second. I wouldn't like for Prime or Casey to feel weird about these, since they might hurt the stream's numbers a bit.
Exactly, wild! So to get to L1 cache (and back) in 3 cycles at 5GHz, the cache could be at most 3/10 foot = 3.6 inches away from the core. And that’s assuming best case just for the speed of electricity itself. The cache has to actually do something, too. In practice, L1 cache is separate for instructions and data, and is physically located right next to the associated pieces of the CPU core.
this was great Prime, i know this type of content is not the best for viewership... but its deeply apprrcieted by some of us who want to learn from people like Casey. He is a national treasure.
Learning how SIMDeez nuts code is generated from a regular c/c++ code by the compiler would be great. Nobody wants to rewrite everything to simd, but just having the compiler do that for them with maybe some minor tweaks and mental model shifts would be great
The compiler is quite limited in what it can vectorize, no? You need to write your program in a vector friendly manner to even hope the compiler will auto vectorize it.
@@TapetBart depends on the semantic of the language, and tons of other things, but yes compiler can't do everything. Especially compilers that were not designed for vectorizing from scratch
Efficient SIMD code is more about data layout and memory access patterns than particular instructions. The compiler typically can't do anything about your data layout so there are serious limits to what auto vectorization can achieve.
@@Bestmann3n it goes both ways without SIMD instructions you can’t really take full advantage of good memory layout, and without a good memory layout you can’t get the best out of SIMD.
For people who are still confused about the L1/TLB address checking explanation: this is just a cache invalidation scheme. Instead of sending the virtual address to the TLB, then finding the physical address in the L1 cache, the TLB and L1 are accessed concurrently, they both produce a physical address, and the cache is valid iff the TLB and L1 cache contain the same physical address. It's important to do this for speed because the TLB is large and slow, because it needs to be in order to support 4KB virtual memory page granularity. The only way to convert from virtual to physical address in the L1 cache is to compare the offset within the page (because the address outside the page is what is modified by the TLB) but not within the cache line (because every byte in the cache line is accessed at the same time). This presents a hard limit on how the L1 cache can be structured without breaking or rewriting operating systems; it can only have as many buckets as can be referred to by this mapping. Bringing it all together: cache lines are 64B == 2^6B, so 6 bits of the address refer to the cache line offset; pages are 4096B == 2^12B, so 12 bits refer to the page offset; 12 - 6 == 6, so 6 bits refer to the offset within a page but not within a cache line; 2^6 == 64, so there can be at most 64 sets (buckets) for the cache; the old L1 cache stored 8 items per bucket, so its total size is 64*8*64B == 32,768B == 32KiB; the new L1 cache stored 12 items per bucket, so its total size is 64*12*64 == 49,152B == 48KiB.
Amazing video. As someone that is studying compilers for hopefully a career switch one day, I would really love to watch that SIMD talk To make this comment more productive I would like to add that majority of cache misses in VMs, like JVM, happens because of the additional tag field added to each object and not because they are scattered all over the memory. JVM for example uses a class of GC algorithms known as mark-compact collectors. During the compact phase GC will place all the objects that reference each other as close as possible. This is something that even a C++ programmer has to actively think and doesn't get for "free". Before the collection happens, Objects are also allocated on something called TLAB, Thread Local Allocation Buffer. These buffers are large memory spaces exclusive to one thread so objects that are allocated by that thread can always be placed next to each other without any interference from the outside world. If anyone is more interested in this stuff I suggest a less known book: The Garbage Collection Handbook: The Art of Automatic Memory Management This book is basically the CLRS of memory management algorithms.
I'd be interested to know how the tag hurts cache performance. Is this just because that extra memory dilutes the cache, or is there some level of indirection going on?
SIMD nuts all the way. The opmask regs that came with AVX-512 are the true GOAT of that extension. New opmask instructions were added for operating on all the vector sizes; 128, 256, and 512-bit.
This is great, I just started the corsera course "Nand to Tetris" so I can actually understand how a computer works. Then boom the same week this gem shows up
@@jwr6796If I remember right all float ops involving NaN spit out NaN, so I don't think it would work...... Now if you could build a logic table where you can get more than one result...... (well, there are 2 types of NaNs signaling vs non signaling, and there's probly some bits left.....)
This is actually one of primary reasons why I bought UA-cam Premium. No ads and offline background videos. Most of UA-cam content that I consume is basically podcasts. I primarily listen to videos and having them downloaded is great when you need to drive/be somewhere without good internet access
I think we should go even deeper with Casey in the future. when I started I programming I watched around 100 episodes of Handmade Hero. I think alot of poeple don't have that context. I know the enough basics of Virtual Memory, Cache Associativity to follow this but I think a lot of people even experienced don't have this context
Man, we need more Casey on the channel. Love hearing his expertise. He is a great teacher. I found the CPU deep dive chat very fascinating. Would love to hear more things like it.
That point Casey made about it feeling positive was spot on. Always feel really excited about my job after listening to these. The point of building things for the joy of building things really hit home as well. I been struggling to figure out why I don't enjoy programming any longer and it is literally because "get it out now!".
This video is so good,.I am listening to it twice! Casey is such a good communicator, he could have just told you, "Intel can't increase cache because 4096 is a small number" but instead he took us through a constructive and instructive journey of the entire system so we could make that conclusion with him. Before he mentioned the memory size limit, I had already intuitively knew this built on the scaffolding he had built in my mind. Brav-f'n-O! This is my Brav-f'n-O face.
I love this kind of stuff. Now I can watch the same video all week! (which is exactly what I'm going to have to do If I want any chance at understanding what these guys are talking about) Edit: I might be exposing myself as a noob but if you hear all this doesn't it make you respect the devices we use everyday that much more.
You guys should do a semi-regular segment, call it "Prime Lesson Time w/ Uncle Casey" I want to call him uncle because of a friend of my dad's who i called uncle who was like Casey; very smart tech wise but had that strong Dad energy and ability to explain things simply as possible. Alt: "Prime Lesson Time w/ Mr Muratori" if you want to be fancy.
This is such great entertainment. I alredy knew most of this but 1) I feel so smart 2) This is not efficiently put together but entertainingly put togeather I have nothing but love for this. SIMD would be great, I would probably orgasm if you'd discuss long word instruction pipelineing, so don't do that. Simply put: This was awesome 🎉
1:04:00 or so was a lightbulb for me and I suddenly understood it once he tied it to a cache miss. I can't believe an hour already passed watching this, it just flew right by
Amazing video, i heard people say AMD made improvements but i didn't understand terms. Finally someone is talking about what the improvements mean, thank you
As a normie with no programming/coding anything, I actually understand this. Cheat Engine vaguely works based off of "bits that don't change" and "bits changing less often" gaming experience ftw
In the discussion of modulo for the hashing vs masking. Masking is modulo for powers of 2 minus 1 for positive integers anyway. i & 255 == i % 255 where i is an unsigned integer.
Props to Mr eagen for following through with esoteric questions that were apparently "spot on". That's not an easy feat to follow Casey's beautiful in depth explanation.
Incredible stuff here, will cross reference in 5 years when finally understand everything Casey said 😅 that translation buffer was kinda crazy of a concept.
Lots of interesting thoughts on the vertical potential of LLMs. IMO they are and continue to be used as blunt instruments: the techniques are brand new, we're still learning incredible amounts about how to to use and combine the components. I think regardless of the hypothetical vertical potential in the future, there are going to be huge amounts of lateral expansion as every industry and niche finds their own special usecases and refined designs.
For anyone wanting to understanding exactly what casey is refering to when he talks about the Associative caching. [Virtual Memory: 13 TLBs and Caches] ua-cam.com/video/3sX5obQCHNA/v-deo.html
I have to rewatch the "32 kib 8 way -> 48kib 12 way" explanation again, I need to take notes and draw some diagrams to understand this. CPUs are so fascinating dude!
Casey is so knowledgeable on this stuff but -- and I dont mean this in a bad way -- speaks in such a dense fashion, that I had to rewind at several points to re-listen to what he said -- just to follow what he was saying in that first hour. I think it'd be virtually impossible to follow what he's saying live since I had to go through what he said at my own pace. It's all good but it's akin to a scientific journal that has to be read over and over again to grasp what is being said instead of being focused on giving a wide perspective or a 'top down' view of the situation. I think his brain just works like that. He's built to walk you through something, not to summarize what something is.
My background for the last 10 years, is datacenter infrastructure at a movie studio, and CPU’s are a very big topic in that space and really in any space where you have scale out compute (scientific computing, rendering, etc). Outside of those environments, CPU really doesn’t matter to the consumers of the CPU unless it causes some sort of bug.
Please keep having Casey on even if its more "eat your vegetables" than "JavaScript junk food" content. I learn so much every time I listen to this guy talk
+
The right amount of brussel sprouts bowls to burgers is 5 to 1
@@monsieuralexandergulbu3678 So 5 bowls of brussel sprouts for every 1 burger, got it.
it's*
I agree
the CrowdStrike joke was lit
real
It was shit
@@saltstillwaters7506 crowdstrike shareholder spotted :p
simdeeznuts
You and Casey have such good chemistry, please consider turning these videos to a podcast series!
Someone doesn't know about the Jeff and Casey show.
@@braincruserThis show was so good, Casey on his unleashed mode.
@@braincruser Yeah, this will also end up with the hosts to the punches.
Please consider sewerslide.
A 400-part series called "Handmade BFFs"
Execution on the crowdstrike joke was really on point
Casey coming in strong with "I don't even know what all this tech slop is, what tf is a fireship and a lavarel?" That's my boy right there! 😂
SIMDeez nuts
give em the ol swizzle
@@DavidM_603 Oh my god. 😂
@@DavidM_603 the ol shuffle
Maximizing the throughput of deeznuts
Another Casey video count me in! I don't care how many hours that guy talks I'm always learning so much from him.
Simdeeznutz, time to learn bud.
Love how Casey is 2x the size of prime just towering over him as disembodied head. 😆
Please invite Casey again! And give him a whiteboard!
Loved the explanation of how L1 cache works.
Prime's perspective as someone who isn't knowledgeable about this topic helped me better understand Casey's explanation.
Would totally watch a regular show or podcast where casey explains to prime how things work down at the hardware level.
It was anything but boring!
Thanks to you two for doing this one
i know lengthy, deep (that's what she said) explanations might be boring for a lot of people and not great to do on stream, but I want to say I really enjoy those. it takes the edge off of all the abstraction we're submerged in every day and it actually feels like computer science. I can't apply anything of what Casey said, but I loved every second. I wouldn't like for Prime or Casey to feel weird about these, since they might hurt the stream's numbers a bit.
Love Casey! He actually knows what he's talking about. Great resource
26:42 Hahaha the chat message “BEAM = berry easy artificial machine” was very under-appreciated
True
I just wanted to say that I really appreciate those in depth explanations
Very good video! Keep having Casey on stream, it's really interesting and entertaining.
I've been listening to Casey since his Handmade Hero series. It was such a formative experience and glad to see him on the channel. Thank you
2 hour long Casey discussion. Sick.
34:48 A handy conversion to remember is that light travels ~1 foot in 1 nanosecond (in a vacuum). Electricity in silicon is about 20% of that
Exactly, wild! So to get to L1 cache (and back) in 3 cycles at 5GHz, the cache could be at most 3/10 foot = 3.6 inches away from the core. And that’s assuming best case just for the speed of electricity itself. The cache has to actually do something, too. In practice, L1 cache is separate for instructions and data, and is physically located right next to the associated pieces of the CPU core.
I care. It's important stuff. Casey Muratori is a fantastic brain that is so enthusiastic. Love that guy.
this was great Prime, i know this type of content is not the best for viewership... but its deeply apprrcieted by some of us who want to learn from people like Casey. He is a national treasure.
Casey is great. Dude is so chill
Learning how SIMDeez nuts code is generated from a regular c/c++ code by the compiler would be great. Nobody wants to rewrite everything to simd, but just having the compiler do that for them with maybe some minor tweaks and mental model shifts would be great
I wanna say mojo is working on something like this if I recall
The compiler is quite limited in what it can vectorize, no? You need to write your program in a vector friendly manner to even hope the compiler will auto vectorize it.
@@TapetBart depends on the semantic of the language, and tons of other things, but yes compiler can't do everything. Especially compilers that were not designed for vectorizing from scratch
Efficient SIMD code is more about data layout and memory access patterns than particular instructions. The compiler typically can't do anything about your data layout so there are serious limits to what auto vectorization can achieve.
@@Bestmann3n it goes both ways without SIMD instructions you can’t really take full advantage of good memory layout, and without a good memory layout you can’t get the best out of SIMD.
For people who are still confused about the L1/TLB address checking explanation: this is just a cache invalidation scheme. Instead of sending the virtual address to the TLB, then finding the physical address in the L1 cache, the TLB and L1 are accessed concurrently, they both produce a physical address, and the cache is valid iff the TLB and L1 cache contain the same physical address. It's important to do this for speed because the TLB is large and slow, because it needs to be in order to support 4KB virtual memory page granularity. The only way to convert from virtual to physical address in the L1 cache is to compare the offset within the page (because the address outside the page is what is modified by the TLB) but not within the cache line (because every byte in the cache line is accessed at the same time). This presents a hard limit on how the L1 cache can be structured without breaking or rewriting operating systems; it can only have as many buckets as can be referred to by this mapping.
Bringing it all together: cache lines are 64B == 2^6B, so 6 bits of the address refer to the cache line offset; pages are 4096B == 2^12B, so 12 bits refer to the page offset; 12 - 6 == 6, so 6 bits refer to the offset within a page but not within a cache line; 2^6 == 64, so there can be at most 64 sets (buckets) for the cache; the old L1 cache stored 8 items per bucket, so its total size is 64*8*64B == 32,768B == 32KiB; the new L1 cache stored 12 items per bucket, so its total size is 64*12*64 == 49,152B == 48KiB.
Please make this a monthly or biweekly podcast. Love y'all's interactions, you really bring the best out of eachother.
Always great to see Casey on the show -- love these interviews!
Amazing video. As someone that is studying compilers for hopefully a career switch one day, I would really love to watch that SIMD talk
To make this comment more productive I would like to add that majority of cache misses in VMs, like JVM, happens because of the additional tag field added to each object and not because they are scattered all over the memory.
JVM for example uses a class of GC algorithms known as mark-compact collectors. During the compact phase GC will place all the objects that reference each other as close as possible. This is something that even a C++ programmer has to actively think and doesn't get for "free".
Before the collection happens, Objects are also allocated on something called TLAB, Thread Local Allocation Buffer. These buffers are large memory spaces exclusive to one thread so objects that are allocated by that thread can always be placed next to each other without any interference from the outside world.
If anyone is more interested in this stuff I suggest a less known book:
The Garbage Collection Handbook: The Art of Automatic Memory Management
This book is basically the CLRS of memory management algorithms.
I'd be interested to know how the tag hurts cache performance. Is this just because that extra memory dilutes the cache, or is there some level of indirection going on?
Wow that is really interesting, I didn't know that. Gonna check up on that book
SIMD nuts all the way. The opmask regs that came with AVX-512 are the true GOAT of that extension. New opmask instructions were added for operating on all the vector sizes; 128, 256, and 512-bit.
This is great, I just started the corsera course "Nand to Tetris" so I can actually understand how a computer works. Then boom the same week this gem shows up
That's a great course and very fun too
But can you do NaN to Tetris?
@@jwr6796 I can't even spell NaN....
His website is gold for this stuff. Almost too much information but its all good
@@jwr6796If I remember right all float ops involving NaN spit out NaN, so I don't think it would work...... Now if you could build a logic table where you can get more than one result...... (well, there are 2 types of NaNs signaling vs non signaling, and there's probly some bits left.....)
I really enjoy in-depth talks like these with Casey. Please keep it going, its really incredible.
I just want to say this was fascinating. we need more of this.
Love Casey and these deep dives. He's incredibly interesting to listening to!
This may actually be my favourite discussion so far, I thought I already understood a lot of this but I was missing some key concepts.
Really good show, watched the whole thing and would love to see another one. Great vibes, learned a lot, what more can you ask for. Keep it up, guys!
You should post this as a podcast, so I can listen to it while walking my dogs in the forest.
YES
please do
I would LOVE to listen to this while walking my dogs in the forest.
IF I HAD ANY
This is actually one of primary reasons why I bought UA-cam Premium. No ads and offline background videos. Most of UA-cam content that I consume is basically podcasts. I primarily listen to videos and having them downloaded is great when you need to drive/be somewhere without good internet access
what prevents you from...i don't know...play the youtube video and listen to it just like a podcast ?
@@OBEYTHEPYRAMID I’m assuming there’s no internet service on his dog walk in the forest
Casey is by far my favorite guest! I learn a ton every time he’s speaking. Also he’s great at simplifying and explaining things!
I think we should go even deeper with Casey in the future.
when I started I programming I watched around 100 episodes of Handmade Hero.
I think alot of poeple don't have that context.
I know the enough basics of Virtual Memory, Cache Associativity to follow this but I think a lot of people even experienced don't have this context
Best content in my feed for weeks. You're both great!
Casey is amazing. Please bring him back!
- How do you know Casey is dropping tech bars?
- His mouth is open
btw, simdeeznuts
Man, we need more Casey on the channel. Love hearing his expertise. He is a great teacher. I found the CPU deep dive chat very fascinating. Would love to hear more things like it.
Him not knowing fireship is funny 🤣
he even said "idk what fireship is" instead of "who" lol
That point Casey made about it feeling positive was spot on. Always feel really excited about my job after listening to these. The point of building things for the joy of building things really hit home as well. I been struggling to figure out why I don't enjoy programming any longer and it is literally because "get it out now!".
SIMDeez NUTs
This video is so good,.I am listening to it twice! Casey is such a good communicator, he could have just told you, "Intel can't increase cache because 4096 is a small number" but instead he took us through a constructive and instructive journey of the entire system so we could make that conclusion with him. Before he mentioned the memory size limit, I had already intuitively knew this built on the scaffolding he had built in my mind. Brav-f'n-O! This is my Brav-f'n-O face.
Thank you Prime! Casey is awesome!This is just such an interesting subject, now searching for the HW Engineer’s perspective as case mention @1:05 :D
SIMDeezNUTZ
I love this kind of stuff. Now I can watch the same video all week! (which is exactly what I'm going to have to do If I want any chance at understanding what these guys are talking about)
Edit: I might be exposing myself as a noob but if you hear all this doesn't it make you respect the devices we use everyday that much more.
Not bored at all dude. Stayed till the end.... 👍
CASEY IS ON THE CASE!!!!
Oh definitely keep these coming, these are a goldmine.
SIMDeezNuts
SIMDEEZNUTS
49:13 missed opportunity to make cache hit joke right there
Not so many popular channels go this deep, explained so well. Prime content right here.
simdeeznuts!
You guys should do a semi-regular segment, call it "Prime Lesson Time w/ Uncle Casey"
I want to call him uncle because of a friend of my dad's who i called uncle who was like Casey; very smart tech wise but had that strong Dad energy and ability to explain things simply as possible.
Alt: "Prime Lesson Time w/ Mr Muratori" if you want to be fancy.
Every time I see a video that has Casey in it makes me smile.
These things just go above my head, there is so much more to learn
Do you know what Beam is? If so, please enlighten me.
This is such great entertainment. I alredy knew most of this but
1) I feel so smart
2) This is not efficiently put together but entertainingly put togeather
I have nothing but love for this. SIMD would be great, I would probably orgasm if you'd discuss long word instruction pipelineing, so don't do that.
Simply put: This was awesome 🎉
This was really interesting, bring Casey on more
1:04:00 or so was a lightbulb for me and I suddenly understood it once he tied it to a cache miss. I can't believe an hour already passed watching this, it just flew right by
Amazing video, i heard people say AMD made improvements but i didn't understand terms. Finally someone is talking about what the improvements mean, thank you
Please do more with thus guy it reminds of the time when we had to know our hardware well if we had to do code for it
I was here for a little of this conversation and it was great.
Simdeeznuts for more Casey interviews
SIMDEEZ NUTS
Bro I love Casey sooo much!! Please bring him on more
Casey seems very knowledgeable, love to hear his thoughts
simdeez nuts
Damn you and your working of the algorithm. Also, SIMDeezNuts.
the crowdstrike joke.... beautiful!
Casey just single handedly elaborated the best JavaScript defense argument EVER
As a normie with no programming/coding anything, I actually understand this. Cheat Engine vaguely works based off of "bits that don't change" and "bits changing less often" gaming experience ftw
The web industry is getting laughed at - but we deserve it.
In the discussion of modulo for the hashing vs masking. Masking is modulo for powers of 2 minus 1 for positive integers anyway. i & 255 == i % 255 where i is an unsigned integer.
wow this is taking me back to the days of cpu designs :) physical & virtual addressing. page aligns, cache flushing. oh the memories.
Props to Mr eagen for following through with esoteric questions that were apparently "spot on". That's not an easy feat to follow Casey's beautiful in depth explanation.
more Casey please! Amazing knowledge and content.
Incredible stuff here, will cross reference in 5 years when finally understand everything Casey said 😅 that translation buffer was kinda crazy of a concept.
I could listen to these deep-dives for ages
Bring Casey more. He is such a delight
Lots of interesting thoughts on the vertical potential of LLMs. IMO they are and continue to be used as blunt instruments: the techniques are brand new, we're still learning incredible amounts about how to to use and combine the components. I think regardless of the hypothetical vertical potential in the future, there are going to be huge amounts of lateral expansion as every industry and niche finds their own special usecases and refined designs.
Casey is awesome, his course opened my mind to new things after 24 years of professional (yea right...) programing.
Well today I truly feel like a nerd. Sadly I understand exactly what Casey is explaining.
This was a great conversations!
Learned alot by listening in😊
Where can I find more stuff like this, on this level!?
I loved this, the whole L1 cache thing was super interesting
Casey is basically explaining the Hennessy Patterson book. Although he's good at doing so :)
😂😂😂 all I needed was the crowd strike joke.
For anyone wanting to understanding exactly what casey is refering to when he talks about the Associative caching.
[Virtual Memory: 13 TLBs and Caches] ua-cam.com/video/3sX5obQCHNA/v-deo.html
So many of these absolute gems of channels buried all over the place, thank you for sharing
I have to rewatch the "32 kib 8 way -> 48kib 12 way" explanation again, I need to take notes and draw some diagrams to understand this.
CPUs are so fascinating dude!
This is Prime content honestly
It's very easy: A and B are isomorphic if you can define a bijection between A and B.
Not sure if you were trying to be funny, but I think 'bijection' is a _less_ known word than isomorphic
@@Muskar2 What do you mean less known, it’s just injection and surjection happening at the same time.
I like your funny words, magic man.
Not sure if you were trying to be funny, but I think 'injection' and ‘surjection’ are less known words than bijection
Casey is so knowledgeable on this stuff but -- and I dont mean this in a bad way -- speaks in such a dense fashion, that I had to rewind at several points to re-listen to what he said -- just to follow what he was saying in that first hour. I think it'd be virtually impossible to follow what he's saying live since I had to go through what he said at my own pace. It's all good but it's akin to a scientific journal that has to be read over and over again to grasp what is being said instead of being focused on giving a wide perspective or a 'top down' view of the situation. I think his brain just works like that. He's built to walk you through something, not to summarize what something is.
He's better when giving a prepared speech - and it helps greatly if he knows his audience well
@@Muskar2Indeed. :)
Interesting interview. Great deepdive about 8 ways etc. Didnt know that. At all.
My background for the last 10 years, is datacenter infrastructure at a movie studio, and CPU’s are a very big topic in that space and really in any space where you have scale out compute (scientific computing, rendering, etc). Outside of those environments, CPU really doesn’t matter to the consumers of the CPU unless it causes some sort of bug.
3:47 That CrowdStrike joke has to go viral.
I can listen to Casey talking metal all day.
did not see the Crowstrike joke coming at all. 😂
This giant head is disturbing
yeah i agree
his brain is too big to fit in normal screen size
true but its funny too :D
The ego must fit somewhere
the giant talking head of wisdom and knowledge