The matrix math behind transformer neural networks, one step at a time!!!
Вставка
- Опубліковано 31 тра 2024
- Transformers, the neural network architecture behind ChatGPT, do a lot of math. However, this math can be done quickly using matrix math because GPUs are optimized for it. Matrix math is also used when we code neural networks, so learning how ChatGPT does it will help you code your own. Thus, in this video, we go through the math one step at a time and explain what each step does so that you can use it on your own with confidence.
NOTE: This StatQuest assumes that you are already familiar with:
Transformers: • Transformer Neural Net...
The essential matrix algebra for neural networks: • Decoder-Only Transform...
If you'd like to support StatQuest, please consider...
Patreon: / statquest
...or...
UA-cam Membership: / @statquest
...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
statquest.org/statquest-store/
...or just donating to StatQuest!
paypal: www.paypal.me/statquest
venmo: @JoshStarmer
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
0:00 Awesome song and introduction
1:43 Word Embedding
3:37 Position Encoding
4:28 Self Attention
12:09 Residual Connections
13:08 Decoder Word Embedding and Position Encoding
15:33 Masked Self Attention
20:18 Encoder-Decoder Attention
21:31 Fully Connected Layer
22:16 SoftMax
#StatQuest #Transformer #ChatGPT
Josh Starmer is the GOAT. Literally every morning I wake up with some statquest, and it really helps me get ready for my statistics classes for the day. Thank you Josh!
Bam!
definitely the goat🐐
Very educational, and also innovative in the way of doing it. I have never seen such teaching elsewhere. You are the BEST !
Thank you! :)
As an electronics hobbyist/student from way back in the 70s I like to keep up as best I can with technology. I'm really glad I don't have to remember all the details in this series. There are so many layers upon layers that at times I do ''just keep going to the end'' of the videos. Nevertheless I still manage to learn key aspects and new terms from your excellent teaching abilities. There must be an incredible amount of work involved in creating these lessons.
I will purchase your book because you deserve some form of appreciation and it'll serve as a great reference resource. Much respect Josh and thanks , Kieron.
Thank you very much!
You weren't kidding, it's here! You're a man of your word and a man of the people.
Thanks!
Josh! Thanks for this video, it has been easier for me to see the matricial representation of the computation than using the previous arrows. I really appreciate your explanation using matrices!
Glad it was helpful!
DUDE JOSH, FINALLY! I have been waiting for this episode for a year or more. I’m so proud of you bro. You got there!
Thanks a ton!
This is really good. The simple example you used was very effective for demonstrating the inner workings of the transformer.
Thank you very much!
always been a huge fan of the channel and at this point in my life this video really couldn't have come at a better time. Thanks for enabling helping us viewers with some of the best content on the planet (I said what I said)!
Thanks!
statquest's the best thing i ever found on the internet
Thank you!
Amazing, thank you Josh. You deserve millions more subscribers
Thank you!
Your videos are a didactic stroke of genius! 👍
Glad you like them!
Thanks for introducing the concepts about transformers
My pleasure!
Wow Sqatch! Long time no see my friend! Good to see you.
Your videos are so much fun that one does not feel we are actually in the class. Thank you Josh.
Thanks!
Please Add this video in your Neural Network Playlist. I recently started watching that playlist
Done!
following you from 🇨🇩
bam!
The full Neural Networks playlist, from the basics to deep learning, is here: ua-cam.com/video/CqOfi41LfDw/v-deo.html
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Just ordered your book 😊 Thanks for the love and care you put into this
Doings the god’s work, Josh!
Thank you!
Amazing video! Can't wait for the next one. By the way, I think there's a small typo at 5:15 where the first query weight in the matrix notation should be 2.22 instead of 0.22
Oops! Thanks for catching that!
Superrrb Awesome Fantastic video
Thanks 🤗!
Josh do you know how to use embedding layers to add context to a regression model?
And do you offer 1-on-1 guidance? I’m stuck on a problem regarding this videos topic
Hmmm...I'm not sure about the first question and, unfortunately, I don't offer one-on-one guidance.
Hey....did you cover the training steps in this video ? Sorry if I missed it
No, just how the math is done when training. We'll cover more details of training in my next video when we learn how to code transformers in PyTorch.
Question: If all tokens can be calculated in parallel, then why is time-to-first-token such an important metric for model performance?
That might be related to decoding, which, when doing during inference, is sequential.
The time to first token may be referring to producing the first token by the decoder in the autoregressive setting, where (for example in sentence translation) the model produces one token at a time, then feeds it into itself, to generate the next one, and so on. This process is sequential, while the computation of all the matrices (of already existing embeddings) is parallel).
perfection
Thank you!
TRIPLE BAM!!!!!!!!
:)
Great details. But. Please. In the education process it's very important to use some imaginable concepts as a frameworks. For me it's hard to connect why we are doing all that digits with the goal and why it works. Start using the concept of a n-sphere (let it be just a 2D circle, since we are using 2 values for tokens) and explain that we are actually rotating the whole n-spheres (circles) with packed Q and K in them and coding-in cases of different co-directionality of vectors measured by the cosine similarity [-1..1] (and actually divided not by mult of 2-norms but by sqrt of dmnsnlty just for comp.performance (you successfully mentioned this)). And when we are multiplying by V - we are actually doing the "mixing" of values in each dimension wrt QK co-directionality as a vectors in a n-sphere. We rotating the n-spheres by multiplying Q and K by matrices Wq and Wk and when we are doing that it's actually works as a rotation, linear transformation can do more but we will use the cosine similarity after it to measure the alignment of the vectors Q and K. Rotations, co-directionality cases code-ins, mixing. Repeat.
Noted!
need a video on degrees of freedom!!!
Noted!
Will you ever make videos on the subjects of Reinforcement learning, NLP or generative models?
I think you could argue that this video is about NLP and is also a generative model, and I'll keep the other topic in mind.
@@statquest I"ll explain myself better as I admit I phrased it poorly. For deep learning and machine learning you made amazing videos that covered the subjects from basic aspects to advanced ones - thus essentially teaching the whole subject in a fun, creative & enjoyable sequence of videos that can help beginners know it from top to bottom.
However, for NLP for example you did talk about specific subjects like word embedding or auto-translation, but there are other topics (mostly older things) in that field that are important to learn such as n-grams & HMM.
So my question was not only about specific advanced topics that connect to others, but rather about a full course that covers the basics of the subject as well.
Sorry for my bad phrasing and thank you both for your quick answer and amazing videos! 😄
@@yuvalalmog6000 I hope to one day cover HMMs.
Usually "vamos" will not be one token but two. How can the algorithm handle this division?
You could split "vamos" into two tokens, "va" and "mos", then the output from the decoder would be "va", "mos", "".
Goat
:)
10:51 How come each token's maximum similarity isn't with itself?
This example, trained on just 2 phrases ("what is statquest? and "statquest is what") is too simple to really show off the nuance in how these things work.
@@statquestah so with more training and a bigger dataset we can expect the weights to give values closer to what we intuitively expect, like, as I said, each word having the biggest similarity with itself? Great video to see the matrices in action, and I like the content and don't want to be rude, but I think touching on such details a bit would've been nice. Also, maybe something on Multi-Head Attention?
@@I.II..III...IIIII..... I believe that is correct. And I'll talk about multi-head attention more in my video on how to code transformers.
Why they used square root of dk ? Why not just dk? ... If anyone knows the answer please give a good explaination
To quote from the original manuscript, "if q and k are independent random variables with mean 0 and variance 1. Then their dot product has mean 0 and variance d_k". Thus, dividing the dot products by the square root of d_k results in variance = 1. That said, unfortunately, as you can see in this illustration, the variance for q and k is much higher than 1, so the theory doesn't actually hold.
Cody finished his story😅
One more to go - in the next video we'll code this thing up in pytorch.
the A B C thing... i think it is inspired by Sesame Street LoL!!!!!!! 🙂
:)
🎉
:)
Kolmogorov-Arnold Networks videoooooo mr bam
I'll keep that in mind.
You made me love data science if not you I would learn as a zombie
bam!
Thanks for the great contents! One minor thing - at 5:24 minute, the first element of the Query weight matrix should be 2.22, but not 0.22
Yep. That's a typo.
It would be nice if you develop courses of Object Detection, mainly YOLO
I'll keep that in mind.
With all due respect, please stop singing at the beginning of your videos. Having that at the beginning of every video is very irritating.
Noted