Ongoing Notes: 1. I should note that the concept of which words pay attention to which others doesn't always line up with our human expectations. In this video, I frequently claim that "meal" should attend to "savory" and "delicious" but if you look at the attention weights matrix at 9:25, "meal" attends the most to "savory" but not so much to "delicious". In reality, the model is going to do what it needs to do to excel at next word prediction, which might mean taking a different approach to setting the attention layer weights than what our human brains would "neatly expect". Still, the illustration of "meal" attending to "savory" and "delicious" is usually correct, but I wanted to clarify that it's not guaranteed and that's not a bad thing.
Question: LLMs obviously 1) account for hierarchies of concepts/abstractions, 2) perform complicated logical operations, decision-tree-like, on those concepts (and words). Having read about attention and having watched a dozen videos on it, I have never encountered an explanation of how attention can do these things. My guess is that the layering of attention layers is instrumental in all of that but I have seen no discussion or explanation of this.
I'm not sure if I would say LLMs "obviously" do those two things, but they are certainly emergent behaviors due to increases in compute. Scaling laws are pretty cool!
yessssss. let's talk about those in the next videos. this is a great channel for the way you explain things. I don't know if it;s too far ahead but it would be awesome to see some small code examples too.
The values for the attention of a word on the Attention Matrix are on the lines? What does the columns represent? I always imagined this matrix to be like a covariance matrix, but by the looks of it I could be more wrong
Ongoing Notes:
1. I should note that the concept of which words pay attention to which others doesn't always line up with our human expectations. In this video, I frequently claim that "meal" should attend to "savory" and "delicious" but if you look at the attention weights matrix at 9:25, "meal" attends the most to "savory" but not so much to "delicious". In reality, the model is going to do what it needs to do to excel at next word prediction, which might mean taking a different approach to setting the attention layer weights than what our human brains would "neatly expect". Still, the illustration of "meal" attending to "savory" and "delicious" is usually correct, but I wanted to clarify that it's not guaranteed and that's not a bad thing.
Best video I have ever seen for explaining attention mechanism and now I got cleared about attention ❤
This channel is gold!!!
crystal clear explanation man, immediately subscribed!
Perfect timing, learning about this in class right now!
You got this!
Question: LLMs obviously 1) account for hierarchies of concepts/abstractions, 2) perform complicated logical operations, decision-tree-like, on those concepts (and words). Having read about attention and having watched a dozen videos on it, I have never encountered an explanation of how attention can do these things. My guess is that the layering of attention layers is instrumental in all of that but I have seen no discussion or explanation of this.
I'm not sure if I would say LLMs "obviously" do those two things, but they are certainly emergent behaviors due to increases in compute. Scaling laws are pretty cool!
Liked , subscribed, and commented. This is pure gold!
Thanks a ton!
great job. I've been studying the subject by myself and had missed the visualization of vector sums in the value space. thanks for posting.
Glad it was helpful!
Fantastic explanation! For the next videos in this series, please touch upon the role of the residual connection. I'm still iffy on what it's doing.
Great suggestion!
Oof, I really needed this a while ago, finally!
Sorry to be late but I hope it was worth it!
Great explanation, loved it!
Glad you liked it!
yessssss. let's talk about those in the next videos. this is a great channel for the way you explain things. I don't know if it;s too far ahead but it would be awesome to see some small code examples too.
Working on it!
thank you for these videos !!!
Of course!
I waited for it for months..
sorry for the wait! hope it is worth it 😎
Can you do a video on how input(ex. Words, videos, audio) are tokenized into vectors
This great. Thanks
No problem!
The values for the attention of a word on the Attention Matrix are on the lines? What does the columns represent? I always imagined this matrix to be like a covariance matrix, but by the looks of it I could be more wrong
❤thanks
Of course!
Amazing video! Would be nice to see how its actually calculated on a small few words sentence
Great suggestion!