Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Grant Sanderson

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 30 гру 2024

КОМЕНТАРІ • 236

@rustyroche1921 Місяць тому ⁺⁶⁰¹
woah grant getting the gains
@kellymoses8566 Місяць тому ⁺¹⁹
I know. Now he is as hot as he is smart!
@poke0003 Місяць тому ⁺³⁷
-"Swol is the goal, size is the prize!" - 3B1B Loss Function, probably
@sho3bum Місяць тому ⁺²⁶
3Curls1Extension Grant Sanderson
@mosubekore78 Місяць тому ⁺²
The newbie gain is no joke
@namehkoudsie6075 Місяць тому ⁺¹
I'm very happy someone else noticed. It would make perfect sense if he's the young version of true detective Russ Cohl
@joecincotta5805 Місяць тому ⁺¹¹⁷
Such a clear thinker and speaker. Aspirational. Zero fill words, absolute clarity.
@nothingnewhere6551 Місяць тому ⁺¹
80 percent of this is fill words
@Omsip123 Місяць тому ⁺⁵
I'm also impressed how well worded he can convey such a complex topic.
@idlewise 21 день тому
@joecincotta5805 I was just scrolling through the comments and read yours just at @26:34 where he says "like" and "um" within a few words!
@85188579 20 днів тому ⁺¹⁶
Grant's speech ability is just too good to be true. Unbelievable human being. It is such a blessing to watch and learn from Grant's talks and videos.
@PatrickMetzdorf Місяць тому ⁺⁵⁶
That was easily the best explanation I have ever seen. Way to decrypt some of the most magical-seeming mechanisms of the transformer architecture. Thanks a lot for this!
@ferencszalma7094 Місяць тому ⁺⁸³
0:52 Transformer paper 2017
1:23 Encoder part for generation
3:04 Chatbot
4:25 Tokenization
5:47 Vector embedding
6:08 Attention block sublayer
7:08 MLP Multi-Layer Perceptron aka feed-forward sublayer
8:52 Repeat those two sublayers as a block many times
11:35 Training, cost function
12:29 Cost hypersurface over the (very high-dimensional) parameter space
13:31 Word vectors
14:28 Similar words cluster near each other in the word vector space
15:02 Mikolov Word2vec paper 2013
17:35 Word in its context
19:13 How many orthogonal vectors in N-dim vector space? ~N
20:23 How many almost orthogonal vectors? ~exp(εN)
22:20 Attention mechanism: word vector slightly rotates due to context: "creature" -> "fluffy blue creature"
24:42 Matrices: Query, Key and Value matrices and their roles
25:48 Query matrix: eg ask if there is a relevant adjective nearby by projecting down to a low-dim "adjective subspace"
27:57 Key matrix: eg each word in context window has its potential relevant "adjectiveness" living in the "adjective subspace" waiting for it to be discovered by a Query
29:25 Dot product: of Query and Key gives strength of attention as a score (which becomes a weight in softmax, see below)
30:03 Attention matrix (score form: pos values -> strong association, neg values -> weak association)
31:24 Attention matrix (weight form by softmaxing each column's scores)
32:04 Softmax: each column is l1-normalized to 1 for stability [not for probability!] to get attention weights for each query word across the keys in the query word's column
32:14 [No probabilities here, also no sampling. Softmax is just a transformation from scores [-inf, inf] to weights [0,a] where a can be 1]
33:50 Training N models in parallel (N is context length)
35:33 Mask out (zero out) lower triangular to make attention matrix causal for generative models aka chatbots
38:27 Character-based model vs subword-based vs word-based: effective context length is shorter for character-based models, fewer words can fit into context window
39:25 Byte-pair encoding
39:35 Value matrix
43:16 Single-head attention
43:46 Multi-headed attention: each head projects into a different subspace, be it the "adjective subspace" or other like time, space, color, smell, group, etc.
45:05 Add the effect of each head [it should be concatenation here as opposed to addition] followed by MLP/FeedForwardSubLayer
45:25 Repeat block many times, where block = Attention sublayer, Feed-forward sublayer [also layer normalization and drop-out]
46:20 Why are Transformers so effective? Scale, parallelization, self-learning (next word is the label), tokenize any data type (text, sound, images, DNA or protein sequences, text+image,...)
49:29 Q&A
@si0n4ra Місяць тому ⁺⁵
thanks for timestamps. Who knows how to add it to the description add +1👍 and tomorrow author will add it 😎
@RodneyWinston-q7y Місяць тому ⁺⁵
Absolutely handy summarization and memory jogger
@amitagarwal5223 17 днів тому ⁺¹
Thank you for adding the timestamps for the topics. Very helpful.
@krishdesai9776 Місяць тому ⁺²⁵²
Someone's been working out!
@F30-Jet Місяць тому ⁺⁷
AI generated😂
@galibmahfuzullah6152 Місяць тому
Grant is So H0T. I want him to do @#$(^# =!#& to my boydy
@mahajanravish 18 днів тому
also finding out some time for studying GenAI🤣
@kyledmorgan 20 годин тому
#ThirstTrapGrant
@magnetsec Місяць тому ⁺²¹³
Grant should team up with Andrej Karpathy. They'd make the best Deep Learning education platform
@nbme-answers Місяць тому ⁺¹⁶
They already do make the best deep learning education platform
@magnetsec Місяць тому
@@nbme-answers Yeah but separately
@tescOne Місяць тому ⁺¹⁰
Two of the most talented educators on yt. Their two series on neural nets are basically anything a curious person needs to start building their own models. Grant gives you the big picture with immense sensibility and insane visualization. Andrej gives you all the technical details in reasoning, implementation and advanced optimization, with an empathy for your ignorance comparable to Feynman's haha.
@aricoleman5802 Місяць тому
@@nbme-answerswha is it?
@ukasz5534 Місяць тому
@@tescOnek
@tarun-hacker Місяць тому ⁺²⁶
Clearly one of the best talks about Transformers
@onicarpeso Місяць тому ⁺⁵³
I finally see the human behind the great videos I watch!
@jacekb4057 10 днів тому ⁺⁴
Im a big fan of your work. Thanks for your contributions to spread the knowledge.
@hantla 7 днів тому ⁺¹
This is the so good. Best description I’ve seen that makes the basics of deep learning and LLMs a little bit accessible. Thank you!
@omarnomad Місяць тому ⁺⁵⁵
38:30 The only reason we use tokenization is due to limited computational resources, *but* not for meaning. We gain efficiency improvements of about ~400% when using BPE for the same budget (1 token ≈ 4 characters).
@muthukumarannm398 Місяць тому ⁺⁸
It also helps if your tokens mean something. Example was in these encoding the place value of numbers is lost hence none of the transformers are good in math. Once openAI changed their encoding then math became better in 4o.
Also when they increased their token count to include phrases instead of letters in foreign languages the foreign language performance was better.
@omarnomad Місяць тому
@@muthukumarannm398 You get the same using Unicode alone, it just bc compute budgets we use tokenizers.
@mpperfidy Місяць тому ⁺²¹
Another in a long, long line of excellent educational presentations. If you didn't exist, we'd have to invent you, which would be quite hard. So I'm glad you already exist.
@carriefu458 День тому
Prof Grant needs to win a Nobel Prize in TEACHING!! 🤓❤
@brandonfrerking592 25 днів тому ⁺³
By far the best visual interpretation of the concept presented …along with your calm passionate quirky demeanor..make me a super fan. Thx for sharing your talent with us all.
@rayankanfar1230 Місяць тому ⁺⁵
Grant is becoming even better and better at explaining things I didn’t even think that’s possible!! ❤
@egoworks5611 Місяць тому ⁺⁵
Such a great way to learn and undertand the intuition behind this work. I sometimes think about the people that started these sorts of works and all the groups of people that thought about the possibility of encoding language and mathematically express it. Comes out that even once you understand this conceptsit is still an outstanding effort and the ideas behind are superb.
Crazy to think that some people thought about this, had the ambition and actually expected to build a tool. Once you understand it and it is well explained, yes, it might look as not impossible, but you still can see how groundbreaking it was.
Thanks Grant for taking the time to share this
@seanvw7905 Місяць тому ⁺⁴
Awesome clarity of thought, communication and visualisation. Big thanks.
@thiagosouza2737 19 днів тому ⁺¹
Bravo! I hope this guys earns as much as him deserves. It is really amazing to watch his explonattions. Thank you!
@Democracy_Manifest 25 днів тому ⁺¹
Such a brilliant lecture. Sharp intellect with perfect communication. Thank you.
@leenurG Місяць тому ⁺¹
Thank you for sharing and explaining difficult constructs I think there are few people who can do this for interested people. I am sure you are watched and appreciated all over the world. Thank you again and please continue
@i18nGuy 29 днів тому ⁺¹
Well done! Thanks for this and for all your videos. This was very clear.
@palaniappanviswanathan6827 Місяць тому ⁺³
Dear Grant, you are a true embodiment of selflessness. Your passion to educate the world is very inspiring. You inspire us to be selfless as well. May GOD bless you and your family to have a healthy, happy, peaceful and a prosperous life. This world needs people like you more than ever. We can't thank you enough for the kind of impact have on all of us. I agree with another follower of you who said you and Andrej Karpathy will make a best combination.
@abc19833 3 дні тому
Aspiring videos,a long time fan of 3blue1brown
@rostislavmarkov7488 Місяць тому ⁺¹
Great content, great storytelling, great speaker! Keep it up!
@beemdude2 Місяць тому ⁺³
Brilliant stuff man .. visualisation makes it so much more easy to digest
@alexandergrass8626 22 дні тому ⁺¹
Grant, you are awesome. I hope to make it to one of your talks one day!
@siyugao5916 15 днів тому ⁺¹
Thank you for the work!
@a_funyun Місяць тому
AWESOME talk! Extremely clearly explained/
I also love how you are clearly extremely excited and happy to be explaining this - enthusiasm is infectious.
@learnbydoingwithsteven Місяць тому ⁺¹⁶
Grant is in great shape.
@Tigidou Місяць тому ⁺¹
The absolute best video explaining LLMs I've seen
@alaad1009 3 дні тому
You're awesome @Grant !
@W00PIE 19 днів тому
Brilliant talk, thanks! Perfect combination of visualization and explanation.
@murmeldin Місяць тому ⁺¹²
Just came here from the LLMs for beginners video. Loved the talk, very informative. Keep the great work up, man 👏🏼
@mahdisaberi3057 Місяць тому
me too 🙌😊
@Claire-cj6nn 8 днів тому ⁺¹
🙏🏼 Thank you !
@dr.mikeybee 18 днів тому
Good job. I'd like to add that attention layers can only create abstract representations of context whereas fully connected feed forward neural nets (multi-layer perceptrons) are functionally complete. Among other things, they do credit assignment to minimize prediction error. This results in encoding all sorts of knowledge. This is why both components are necessary.
@mrmadmaxalot Місяць тому ⁺⁴
For the question about analog at the end, the answer he gave is more or less correct from what I know. There is a company (I'll edit if I find a link or name- I want to say they were featured in part of a Veritasium video) that has figured out how to create a 'programable' resistor array using modified flash memory, which can multiply an input vector of voltage values by a resistor matrix to produce and output voltage valued vector. The drawback at the time I heard of it is that it creates a very fast running model, but is not practical for the training side, which is the most compute intensive part anyhow. I greatly suspect there will be significant progress on this in the coming years.
@Kvil Місяць тому ⁺¹⁰⁰
he should be steve in minecraft movie
@0fpm531 Місяць тому
real
@kellymoses8566 Місяць тому
Can't be worst than Jack Black
@Marksman560 29 днів тому
Great talk bro! Looking forward to see more videos in the future 😄
@МадиЮсупов-ю7ф 28 днів тому
I didn't understand half of it. So I will rewatch it definitely. Thank you for great lecture
@pufthemajicdragon Місяць тому ⁺¹⁵
That question at the 54 minute mark about analog computing making LLMs more efficient - yes. There are a LOT of smart people experts in the field who are working on exactly that. Maybe a next direction for your continued learning?
@PavitrPrabhakar-Earth50101 28 днів тому ⁺¹
I've been using manim to create videos for myself, but is it able to create presentation slides like those in the video?
@rorolonglegs4594 Місяць тому ⁺⁷
Great addition to your pre-existing series!
@souvikbhattacharyya2480 Місяць тому ⁺¹
I wouldn't mind "giving a talk" type videos like this from Grant every now and then. I think I would actually prefer this style over the regular one.
@abhidon0 Місяць тому ⁺⁶³
I guess the main question here is "Is Grant Natty?"
@jordantylerflores Місяць тому ⁺¹¹
As someone who is in the, "wishes he took math more serious" camp; I wish we were given more, ANY, cool examples of what was possible with applied math. Growing up in rural Ohio, the only things that math was pushed for was business/finance and maybe some CS stuff however, it was always abstract Here are some concepts, learn them for the test. Like how many cool things can be done inside of 3D programs such as Blender with just an above-level understanding of geometry.
I acknowledge my failings in this too, as I did not seek these things out while I was in school. I also might have some age related FOMO lol. Since the things I enjoy doing now, VFX/Blender/CGI, are all things based on concepts I am having to teach/re-learn on my own, as a man who is almost 40.
Thank you for this, and it is going to take a couple watches for it to sink in haha.
@kellymoses8566 Місяць тому ⁺²
I agree. Kids would put a lot more effort to learn math if they were shown how incredibly useful math is in real life. Being really good at math is like having a superpower compared to people who are not.
@nicolareiman9687 Місяць тому ⁺³
The problem is that the intention of mathematicians isn't to apply these mathematical concepts in the fields that you mention. In fact, most mathematical concepts were developed a long time ago, and they were considered basically useless abstractions until someone found a great usecase for them.
@JOHAN_PERJUS 20 днів тому
Exactly how I feel but at age 51
@Soniiice Місяць тому ⁺¹
Amazing! Thank you for this!
@rahultino 27 днів тому ⁺¹
Such a nice explanation. Visualization helps understand this complex concept. For quite some time, I thought the speaker was also AI generated 🙂
@jx-nz8cl 21 день тому
It's simply the best explanation for the 'Attention'! What amazed me in the end is that with these billions of parameters, the training model can somehow converge to some meaningful point. :) Do you guys still remember people in the field were worrying about the 'local minimum' for a very small NN model in the old days ? :)
@coatraincoaching 26 днів тому ⁺¹
Great info graphics. Is there any tool available to make such?
@gersongalante4995 9 днів тому
I want know too
@robototo1 Місяць тому ⁺¹
I wonder what tool or programming language was used to develop that presentation?
@gersongalante4995 9 днів тому
Me too
@BananthahallyVijay Місяць тому
👌👏👏
Grant Sanderson never disappoints.
You can watch all the dozens of videos on deep learning out there, OR just watch one like this from Grant Sanderson.
But can't help wondering how far the world's resources can sustain creating billions of tokens to train LLMs. It's perhaps time for rule based systems?
@dr.mikeybee 18 днів тому
I also think it's important to understand that training is holistic. These models learn to minimize all error. So they are not as biased as they might be considering they are trained on human data.
@Linz0r1s Місяць тому
thank you so much for this video !
you are so clear in your explanations
@Loveforcricket99 Місяць тому ⁺²
For a word like ‘bank’ which can have different meanings for different contexts, does the LLM store it as a single vector or it can store multiple vectors for each known variations of the word?
@GrantSanderson Місяць тому ⁺¹¹
It’s initially embedded as one vector, but one big point of the attention layer is to allow for context-based updates to that vector
@dudicrous Місяць тому
I'd imagine one of the numbers indicating relationship to economy and somewhere else in the 12k number line an indication for furnitureness. Still 1 vector though.
@viviankoneke1389 24 дні тому
I have a question concerning the order of words: Since the attention pattern is designed in a way that words that come later don't influence words that came earlier, does this have a (maybe small) impact on how well it performs for different languages, given that the word order in different languages are different?
@Manu-lc4ob 4 дні тому
Is the analogy of saying that the embedding accumulate information correct ? To me it’s more of a trigger and then the sate accumulate in the hidden layers, which much higher dimensions than the input token embedding size.
@ArturoNereu Місяць тому
What a beautiful and insightful presentation :)
@gunasekhar8440 17 днів тому ⁺¹
We can understand entire concept of attention with a single statement he used at 44:42 !
@japethstevens8473 29 днів тому
A visual treat and certainly clarifies my understanding of inferencing. What is the tool you use to develop such great graphics (across all your vids)?
@aiamfree Місяць тому ⁺¹
transformers perform better when you embed concepts then data around those concepts, and predict tokens from concepts not from a length of tokens (you also get better alignment, hence reduction of hallucinations). This works to the point where even without pretraining you can get some pseudo-generative emergence… Been testing it for a couple of months now.
@aiamfree Місяць тому
Think of it like aggregation of “thoughts” before spewing predictions
@aiamfree Місяць тому
Additionally…when the training is spread as radians from intent of query, the tokens can be processed with “Radian Gating” by best result is more effective than top_p limiting…
@aiamfree Місяць тому
Furthermore, by using a token as origin, each can aggregate more tokens in this same gradient. Instead of pingpongin between vectors to weigh them seperately, this method can see far and wide.
@aiamfree Місяць тому ⁺¹
And so there could be a case where there is no need for padding and predetermined dimensions, which isn’t scalable. As it can be of 0 to inf dimensions… [where 0 is first index]
How do I say this…If you are trying to find something in the dark, would you use a laser pointer or a flashlight?
@faraazali9589 Місяць тому
wow amazing talk bro, beautiful visualizations.
@literailly Місяць тому ⁺¹
@39:00, Why not make tokens full words?
(time to read up on byte-pair encoding!)
@gastonmorixe Місяць тому ⁺¹
There are too many words and too restricted
@nomatternevermind6818 24 дні тому
@19:13 I certainly enjoyed this one!! Apart of everything is else!
@RickySupriyadi Місяць тому ⁺¹
of I'm not mistakenly remember the early day Google translate released i was really astonished by how accurate it was. then after several updates it becomes word to word translator rather than sentence translator. I wonder where that early model go? does Google actually have AGI since way before 2017?i also remember somewhere 2018 they release a model that could speak in phone call almost gpt4o voice mode did,and I haven't heard anything about it .
@alexleo4863 Місяць тому
Add the fact that they invented transformer
@oculusisnevesis5079 16 днів тому
very articulate. love it
@var309 Годину тому
Wish had this guy as my university AI professor
@MaherBoudabra Місяць тому ⁺¹
Could you please explain more the idea of the number of perpendicular vectors that can fit in dimension N? To me it is infinite, unless you want a specific configuration.
@RickLindstrom 29 днів тому
I assume the question is asking about vectors that are all perpendicular to each other. A 2D piece of paper, for example, can only have two vectors that are perpendicular to each other. There are many configurations of those two vectors but that isn't part of the question.
@MaherBoudabra 24 дні тому
@@RickLindstrom This should be clarified though.
@seraphimwang Місяць тому
Sorry for my question, but I am wondering how does he (Grant Sanderson) make the presentation in PowerPoint fashion by clicking the next slide or animations but using Manim?
Does anybody have a suggestions, please?
Thanks,
@hahiZY Місяць тому ⁺¹
Yes, we need more visualization of high dimension MLP! it's a critical thing to solve.
@JuliusUnique Місяць тому ⁺¹
which word/token is in the middle at 0 0 0 0 0 ... for example for chat gpt 4?
@init_bobjames Місяць тому
why am I immediatly thinking three blue? 😊 love this let’s gooo 🙌
@noorghamm3449 Місяць тому ⁺²
Thank you❤️
@ChardonnayWest 21 день тому
Grant has the voice of a generation
@undisclosedmusic4969 Місяць тому ⁺⁴
My left ear thanks you
@andrewwertheim Місяць тому ⁺¹
Feedback (and observation): I find, for whatever reason, having the thumbnail of you talking actually helps to cement these concepts more firmly in my mind than when your face is not present, like in your normal videos. Just an observation I wanted to pass along :)
Great explainer. Really helpful. Thanks Grant
@michael.a.covington 24 дні тому
Nice talk. But one quibble. Does your diagram show "cleverest" as three tokens, cle+ver+est? It is two, clever+est, in any normal way of tokenizing English. Of course, the machine learning will quickly overcome any unhelpful tokenization in the input, especially if the incorrect tokens are smaller than the optimal ones.
@michael.a.covington 22 дні тому
I think I know. It must have been learning the tokenization by byte-pair encoding, and simply didn't run long enough to get the longer element "clever".
@celestchowdhury2605 12 днів тому
great talk!
@vahidvaisi4392 23 дні тому
thank you sir, as always
@nathansimons2138 24 дні тому
LLMs are perhaps best summarized at exactly 2:34 by the example in the background:
"To date, the cleverest thinker of all time was *undoubtedly*... IDK, one of these 20+ people."
@behrad9712 Місяць тому
Thank you so much I interested in A.I and mathematics again 🙏👌
@ricardofonseca7810 16 днів тому
thanks keep posting
@mabsfreeman1187 6 днів тому ⁺¹
I finally f***ing understand it!!!
@Dron008 Місяць тому
Excellent explanation. At least I am starting to understand it.
@Vivekagrawal5800 21 день тому
Off topic- Does the MBTI personality test predict an INTP personality for you ?
@sajal5150 3 дні тому
just one word for you.....WOW
@forbiddenera 28 днів тому
Best clarification of "LLMs are just word predictors"
@SaintThomasOfAquinas Місяць тому
why is the cost -log(p) and not some other value like -1/p
@mariozupan1662 Місяць тому ⁺¹
Great explanation of the topic. It helped me a lot to understand the paper. Few weird questions by public at the end. What's wrong with them?
@mariozupan1662 Місяць тому
You have a great lecturer's rhythm.
@ramesh213 Місяць тому
Just beautiful.
@entschiedungsproblem 22 дні тому
5:58 => Nooo! See "Distributional Semantics" Morten Christiansen (forthcoming)
@ViniDerp Місяць тому
Amazing talk! 👏
@harishkarumuthil5761 17 днів тому
in the demo, you placed male on top ( +z axis ) of female which i think is politically incorrect as per "modern " social norms. idealy both should occupy same cordinates . 😂 16:50
( Thank you so much for wonderful explanation ..❤ )
@harishtad Місяць тому
thank you so much!
@zamplify Місяць тому ⁺⁵³
3 blew one blown
@m41437 Місяць тому ⁺⁸
I really hope this has something to do with the video
@sblowes Місяць тому ⁺²
That’s clever
@jaydeep-p Місяць тому ⁺¹
Nice try Diddy
@fakegandhi5577 Місяць тому ⁺⁴
Oh my god. This is incredible. You're a genius!
@jhdsipoulrtv170 Місяць тому ⁺⁴
This is truly one of the most clever things I have seen a long time
@desvendandoornasaude4127 29 днів тому
Factorial Analysis seems to be the math base behind it, right?
Greetings from Brazil!
@erdwaermung9235 Місяць тому
Women: "People comment too often on my look"
Mr Sanderson: "Hold my LLM ..." 00:15
@alemswazzu 26 днів тому
Grant and Professor Leonard going for the "muscles of math" club.
@rifatmithun8948 Місяць тому ⁺²
Your voice seems very familiar. It took me 10 seconds to realize you are the 3b1b.
@skilz8098 14 днів тому
Year is 1983: "A strange game. The only winning move is not to play.”
...
"How about a nice game of chess?"

Наступне

Автоматичне відтворення

The Dark Matter of AI [Mechanistic Interpretability]