With that, these are the three videos I had planned out. Do check out the previous ones if you missed them! What kind of videos would you guys like to see next?
Hey , I consider vcubingx should explain the sparse attention it make the models handle large inputs more efficiently by only attending to a subset of elements . In large sequences it helps in a advantage of calculation (as it requires less calculation than softmax). I will recommend you to read this 'research.google/blog/rethinking-attention-with-performers/?m=1'
This is one of the best explanations of attention I have seen so far. Understanding the bottleneck motivation really makes this clear right around 3:15.
I really like how easy you make it to understand the why of things. I think you've accomplished your goal of making it seem like I could come up with this! Please cover multi headed self attention next! :) I am worried that this simple approach skips important pieces of the puzzle though. Transformers do have a lot of moving parts it seems. But it seems like you're only getting started!
Thank you, Vivek. Absolutely love your content. Please also keep adding Math content, though. Maybe create a playlist about different functions, limits etc? Whatever suits you.
Great material and presentation, thanks a lot for your work! I'd like to see some deep dive into how embeddings work, as we can get embeddings from decoder-only models like GPTs, Llamas, etc. and they use some form of embeddings for their internal representations, right? But there are also encoder-only models like BERT and others (OpenAIs text-embedding models) which are actually used instead. What is their difference and why does one work better than the other? Is it just because of computer differences or are there some inherent differences?
Every video on planet earth explains attention with "translation", when every individual on planet earth uses ChatGPT "NOT IN TRANSLATION". We use it to CHAT ... Why use translation to explain ? It is so wired ....
With that, these are the three videos I had planned out. Do check out the previous ones if you missed them!
What kind of videos would you guys like to see next?
Hey , I consider vcubingx should explain the sparse attention it make the models handle large inputs more efficiently by only attending to a subset of elements . In large sequences it helps in a advantage of calculation (as it requires less calculation than softmax).
I will recommend you to read this 'research.google/blog/rethinking-attention-with-performers/?m=1'
This is one of the best explanations of attention I have seen so far. Understanding the bottleneck motivation really makes this clear right around 3:15.
you’re doing god’s work brother, thank you for the series
Thanks!
great explanation
I really like how easy you make it to understand the why of things. I think you've accomplished your goal of making it seem like I could come up with this!
Please cover multi headed self attention next! :)
I am worried that this simple approach skips important pieces of the puzzle though. Transformers do have a lot of moving parts it seems. But it seems like you're only getting started!
Thanks for this series :)
Thank you, Vivek. Absolutely love your content. Please also keep adding Math content, though. Maybe create a playlist about different functions, limits etc? Whatever suits you.
Just wow! Subscribed.
Thank you for your work! Your videos were very helpful for understanding the evolution of transformers 👍
Best explanation when i have found so far, thank you
What a great video mfv I paid attention the whole time
Nice, time to boost this video in the algorithm by typing out a comment
Truly amazing explanation, thx!
it was really good. thank you
What's the name of the piano tune that appears at the beginning of the video?
Great material and presentation, thanks a lot for your work! I'd like to see some deep dive into how embeddings work, as we can get embeddings from decoder-only models like GPTs, Llamas, etc. and they use some form of embeddings for their internal representations, right? But there are also encoder-only models like BERT and others (OpenAIs text-embedding models) which are actually used instead. What is their difference and why does one work better than the other? Is it just because of computer differences or are there some inherent differences?
awesome
What about Q, K a d V matrixes meaning?
nice vid
tyfs
epic
❤❤❤
If you hate others, your really just hating yourself, because we are all one with god source
Weird 3b1b has the same series going on now.
He works for 3b1b
Every video on planet earth explains attention with "translation", when every individual on planet earth uses ChatGPT "NOT IN TRANSLATION". We use it to CHAT ... Why use translation to explain ? It is so wired ....
Thanks!