That was easily the best explanation I have ever seen. Way to decrypt some of the most magical-seeming mechanisms of the transformer architecture. Thanks a lot for this!
Two of the most talented educators on yt. Their two series on neural nets are basically anything a curious person needs to start building their own models. Grant gives you the big picture with immense sensibility and insane visualization. Andrej gives you all the technical details in reasoning, implementation and advanced optimization, with an empathy for your ignorance comparable to Feynman's haha.
Another in a long, long line of excellent educational presentations. If you didn't exist, we'd have to invent you, which would be quite hard. So I'm glad you already exist.
38:30 The only reason we use tokenization is due to limited computational resources, *but* not for meaning. We gain efficiency improvements of about ~400% when using BPE for the same budget (1 token ≈ 4 characters).
Such a great way to learn and undertand the intuition behind this work. I sometimes think about the people that started these sorts of works and all the groups of people that thought about the possibility of encoding language and mathematically express it. Comes out that even once you understand this conceptsit is still an outstanding effort and the ideas behind are superb. Crazy to think that some people thought about this, had the ambition and actually expected to build a tool. Once you understand it and it is well explained, yes, it might look as not impossible, but you still can see how groundbreaking it was. Thanks Grant for taking the time to share this
30 to 50% of the brain cortex neurons are devoted to vision or sight, as compared to 8 percent for touch and just 3 percent for hearing. That means learning how to see or look and process visual information is at the center of human intelligence.
That question at the 54 minute mark about analog computing making LLMs more efficient - yes. There are a LOT of smart people experts in the field who are working on exactly that. Maybe a next direction for your continued learning?
Thanks for the breakdown! A bit off-topic, but I wanted to ask: I have a SafePal wallet with USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How should I go about transferring them to Binance?
As someone who is in the, "wishes he took math more serious" camp; I wish we were given more, ANY, cool examples of what was possible with applied math. Growing up in rural Ohio, the only things that math was pushed for was business/finance and maybe some CS stuff however, it was always abstract Here are some concepts, learn them for the test. Like how many cool things can be done inside of 3D programs such as Blender with just an above-level understanding of geometry. I acknowledge my failings in this too, as I did not seek these things out while I was in school. I also might have some age related FOMO lol. Since the things I enjoy doing now, VFX/Blender/CGI, are all things based on concepts I am having to teach/re-learn on my own, as a man who is almost 40. Thank you for this, and it is going to take a couple watches for it to sink in haha.
I agree. Kids would put a lot more effort to learn math if they were shown how incredibly useful math is in real life. Being really good at math is like having a superpower compared to people who are not.
Please collaborate with Andrej karpathy and make a huge deep learning platform or at least explain stuff in this format regularly as we not need animations every time ppt or chalk and talk is also fine sir !
For a word like ‘bank’ which can have different meanings for different contexts, does the LLM store it as a single vector or it can store multiple vectors for each known variations of the word?
Grant! We now know what LLMs are, but what about LMMs - Learning Mealy Machines (named so by me)? A learning Mealy machine is a finite automaton in which training data stream is remembered by constructing disjunctive normal forms of the output function of the automaton and the transition function between its states. Then those functions are optimized (compressed with losses by logic transformations like De Morgan's Laws, arithmetic rules, instruction loop rolling/unrolling, etc.) into some generalized forms. That introduces random hypotheses into the automaton's functions, so it can be used in inference. The optimizer for automaton's functions may be another AI agent (even Neural Nets), or any heuristic algorithm, which you like. Machine instructions would be used to calculate the output function and the transition function of the automaton. At first, as the automaton tries some action and receives a reaction, corresponding terms of those functions are constructed in plain "mov"s and "cmp"s with "jmp"s (suppose x86 ISA here). Then machine instructions of all actions-reactions are optimized by arithmetic rules, loop rolling and unrolling, etc, so the size of the program is reduced. That optimization may include some hypotheses about "Don't Care" values of the functions too, which will be corrected in future passes, if they turn out to be wrong... Imagine that code running on something like Thomas Sohmers' Neo processor, or Sunway SW26010, or Graphcore Colossus MK2 GC200. One kind of transformation they seem often forget is "a loop rolling" (not just un-rolling). I.e. making an instruction loop ("for x in range a..b" statement) from a long repetitive sequence of instructions. ...Kudos for Bodybuilding!
Agreed. Except here we are not talking about "choosing". We are talking about "calculating the probability that a specific word belongs there". An this is (mainly) math.
Great stuff, Yet a generalized go/nogo theory or reference in space doesn't undoubtedly build an assimilated seed of deterministic responsibility for our mixed multitude to simulate strong indentefiers and compute the modern world that would be a sir on the opposite side of the eqauvalance principle to einstein lol Great thinker in renormalization overly extended and everyone is ready for over delayed era of optimization. We got nuked and detoured this quest but its great to be back on oar with goals of multiple genrations that was so rudely interrupted by the world
woah grant getting the gains
high dimensional vascularity
I know. Now he is as hot as he is smart!
-"Swol is the goal, size is the prize!" - 3B1B Loss Function, probably
@@poke0003Ah I see you are a man of culture as well. Glad to see other Robert Frank Connoisseurs :)
3Curls1Extension Grant Sanderson
That was easily the best explanation I have ever seen. Way to decrypt some of the most magical-seeming mechanisms of the transformer architecture. Thanks a lot for this!
Someone's been working out!
AI generated😂
Grant should team up with Andrej Karpathy. They'd make the best Deep Learning education platform
They already do make the best deep learning education platform
@@nbme-answers Yeah but separately
Two of the most talented educators on yt. Their two series on neural nets are basically anything a curious person needs to start building their own models. Grant gives you the big picture with immense sensibility and insane visualization. Andrej gives you all the technical details in reasoning, implementation and advanced optimization, with an empathy for your ignorance comparable to Feynman's haha.
@@nbme-answerswha is it?
Another in a long, long line of excellent educational presentations. If you didn't exist, we'd have to invent you, which would be quite hard. So I'm glad you already exist.
38:30 The only reason we use tokenization is due to limited computational resources, *but* not for meaning. We gain efficiency improvements of about ~400% when using BPE for the same budget (1 token ≈ 4 characters).
I finally see the human behind the great videos I watch!
Such a great way to learn and undertand the intuition behind this work. I sometimes think about the people that started these sorts of works and all the groups of people that thought about the possibility of encoding language and mathematically express it. Comes out that even once you understand this conceptsit is still an outstanding effort and the ideas behind are superb.
Crazy to think that some people thought about this, had the ambition and actually expected to build a tool. Once you understand it and it is well explained, yes, it might look as not impossible, but you still can see how groundbreaking it was.
Thanks Grant for taking the time to share this
I guess the main question here is "Is Grant Natty?"
he should be steve in minecraft movie
real
Can't be worst than Jack Black
I wouldn't mind "giving a talk" type videos like this from Grant every now and then. I think I would actually prefer this style over the regular one.
Grant is in great shape.
Just came here from the LLMs for beginners video. Loved the talk, very informative. Keep the great work up, man 👏🏼
me too 🙌😊
30 to 50% of the brain cortex neurons are devoted to vision or sight, as compared to 8 percent for touch and just 3 percent for hearing.
That means learning how to see or look and process visual information is at the center of human intelligence.
That question at the 54 minute mark about analog computing making LLMs more efficient - yes. There are a LOT of smart people experts in the field who are working on exactly that. Maybe a next direction for your continued learning?
Great addition to your pre-existing series!
My left ear thanks you
Thank you❤️
Great talk! Bad questions.
Thanks for the breakdown! A bit off-topic, but I wanted to ask: I have a SafePal wallet with USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How should I go about transferring them to Binance?
As someone who is in the, "wishes he took math more serious" camp; I wish we were given more, ANY, cool examples of what was possible with applied math. Growing up in rural Ohio, the only things that math was pushed for was business/finance and maybe some CS stuff however, it was always abstract Here are some concepts, learn them for the test. Like how many cool things can be done inside of 3D programs such as Blender with just an above-level understanding of geometry.
I acknowledge my failings in this too, as I did not seek these things out while I was in school. I also might have some age related FOMO lol. Since the things I enjoy doing now, VFX/Blender/CGI, are all things based on concepts I am having to teach/re-learn on my own, as a man who is almost 40.
Thank you for this, and it is going to take a couple watches for it to sink in haha.
I agree. Kids would put a lot more effort to learn math if they were shown how incredibly useful math is in real life. Being really good at math is like having a superpower compared to people who are not.
Good job 😃
Please collaborate with Andrej karpathy and make a huge deep learning platform or at least explain stuff in this format regularly as we not need animations every time ppt or chalk and talk is also fine sir !
3 blew one blown
I really hope this has something to do with the video
That’s clever
Nice try Diddy
Oh my god. This is incredible. You're a genius!
This is truly one of the most clever things I have seen a long time
For a word like ‘bank’ which can have different meanings for different contexts, does the LLM store it as a single vector or it can store multiple vectors for each known variations of the word?
It’s initially embedded as one vector, but one big point of the attention layer is to allow for context-based updates to that vector
which word/token is in the middle at 0 0 0 0 0 ... for example for chat gpt 4?
@39:00, Why not make tokens full words?
(time to read up on byte-pair encoding!)
Your voice seems very familiar. It took me 10 seconds to realize you are the 3b1b.
Grant! We now know what LLMs are, but what about LMMs - Learning Mealy Machines (named so by me)?
A learning Mealy machine is a finite automaton in which training data stream is remembered by constructing disjunctive normal forms of the output function of the automaton and the transition function between its states. Then those functions are optimized (compressed with losses by logic transformations like De Morgan's Laws, arithmetic rules, instruction loop rolling/unrolling, etc.) into some generalized forms. That introduces random hypotheses into the automaton's functions, so it can be used in inference. The optimizer for automaton's functions may be another AI agent (even Neural Nets), or any heuristic algorithm, which you like.
Machine instructions would be used to calculate the output function and the transition function of the automaton. At first, as the automaton tries some action and receives a reaction, corresponding terms of those functions are constructed in plain "mov"s and "cmp"s with "jmp"s (suppose x86 ISA here). Then machine instructions of all actions-reactions are optimized by arithmetic rules, loop rolling and unrolling, etc, so the size of the program is reduced. That optimization may include some hypotheses about "Don't Care" values of the functions too, which will be corrected in future passes, if they turn out to be wrong...
Imagine that code running on something like Thomas Sohmers' Neo processor, or Sunway SW26010, or Graphcore Colossus MK2 GC200.
One kind of transformation they seem often forget is "a loop rolling" (not just un-rolling). I.e. making an instruction loop ("for x in range a..b" statement) from a long repetitive sequence of instructions.
...Kudos for Bodybuilding!
great
So good
let's go
Choosing the next word, by any name, is thinking.
Agreed. Except here we are not talking about "choosing". We are talking about "calculating the probability that a specific word belongs there". An this is (mainly) math.
It could be nice to see more about KAN approach which is very promising.
nooo you were in munich and didn't tell us :((((
amazing
Another roof video!? Oh…
I graduated with a degree in electrical engineering back in '07. I did not understand most of anything that was talked about in this video.
im really proud of being alive at the same time as you
Great Speech
Video came out like 10 minutes ago and it is 50 mins long
@@raideno56 That's what's so great about it, very big
@@raideno56 watched it on 5x speed lol
@@jordantylerflores did you have subway surfers on the side as well?
GRANT IS IT NOT BASICALLY DOE IN STATISTICS. KIND REGARDS JASON
you're smiling like you're microdosing LSD or something
hi
Second!
First! (of commenting after watching the whole thing)
Why would he explain cartoon to them?
Great stuff,
Yet a generalized go/nogo theory or reference in space doesn't undoubtedly build an assimilated seed of deterministic responsibility for our mixed multitude to simulate strong indentefiers and compute the modern world that would be a sir on the opposite side of the eqauvalance principle to einstein lol
Great thinker in renormalization overly extended and everyone is ready for over delayed era of optimization. We got nuked and detoured this quest but its great to be back on oar with goals of multiple genrations that was so rudely interrupted by the world
❤🫡
Why do people make that mouth smacking sound whenever they start a sentence