The matrix math behind transformer neural networks, one step at a time!!!

StatQuest with Josh Starmer

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 31 тра 2024
Transformers, the neural network architecture behind ChatGPT, do a lot of math. However, this math can be done quickly using matrix math because GPUs are optimized for it. Matrix math is also used when we code neural networks, so learning how ChatGPT does it will help you code your own. Thus, in this video, we go through the math one step at a time and explain what each step does so that you can use it on your own with confidence.
NOTE: This StatQuest assumes that you are already familiar with:
Transformers: • Transformer Neural Net...
The essential matrix algebra for neural networks: • Decoder-Only Transform...
If you'd like to support StatQuest, please consider...
Patreon: / statquest
...or...
UA-cam Membership: / @statquest
...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
statquest.org/statquest-store/
...or just donating to StatQuest!
paypal: www.paypal.me/statquest
venmo: @JoshStarmer
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
/ joshuastarmer
0:00 Awesome song and introduction
1:43 Word Embedding
3:37 Position Encoding
4:28 Self Attention
12:09 Residual Connections
13:08 Decoder Word Embedding and Position Encoding
15:33 Masked Self Attention
20:18 Encoder-Decoder Attention
21:31 Fully Connected Layer
22:16 SoftMax
#StatQuest #Transformer #ChatGPT

КОМЕНТАРІ • 88

@samglick8479 Місяць тому ⁺¹⁰
Josh Starmer is the GOAT. Literally every morning I wake up with some statquest, and it really helps me get ready for my statistics classes for the day. Thank you Josh!
@statquest Місяць тому ⁺¹
Bam!
@leonardfei8154 Місяць тому ⁺¹
definitely the goat🐐
@NJCLM 12 днів тому ⁺¹
Very educational, and also innovative in the way of doing it. I have never seen such teaching elsewhere. You are the BEST !
@statquest 12 днів тому
Thank you! :)
@colekeircom Місяць тому ⁺¹
As an electronics hobbyist/student from way back in the 70s I like to keep up as best I can with technology. I'm really glad I don't have to remember all the details in this series. There are so many layers upon layers that at times I do ''just keep going to the end'' of the videos. Nevertheless I still manage to learn key aspects and new terms from your excellent teaching abilities. There must be an incredible amount of work involved in creating these lessons.
I will purchase your book because you deserve some form of appreciation and it'll serve as a great reference resource. Much respect Josh and thanks , Kieron.
@statquest Місяць тому ⁺¹
Thank you very much!
@jordantran3102 Місяць тому ⁺³
You weren't kidding, it's here! You're a man of your word and a man of the people.
@statquest Місяць тому
Thanks!
@roberto2912 Місяць тому ⁺³
Josh! Thanks for this video, it has been easier for me to see the matricial representation of the computation than using the previous arrows. I really appreciate your explanation using matrices!
@statquest Місяць тому
Glad it was helpful!
@mraarone Місяць тому ⁺⁵
DUDE JOSH, FINALLY! I have been waiting for this episode for a year or more. I’m so proud of you bro. You got there!
@statquest Місяць тому ⁺¹
Thanks a ton!
@TheCJD89 Місяць тому ⁺¹
This is really good. The simple example you used was very effective for demonstrating the inner workings of the transformer.
@statquest Місяць тому
Thank you very much!
@roro5179 Місяць тому ⁺¹
always been a huge fan of the channel and at this point in my life this video really couldn't have come at a better time. Thanks for enabling helping us viewers with some of the best content on the planet (I said what I said)!
@statquest Місяць тому
Thanks!
@Aa-fk8jg 18 днів тому ⁺¹
statquest's the best thing i ever found on the internet
@statquest 18 днів тому
Thank you!
@MakeDataUseful Місяць тому ⁺¹
Amazing, thank you Josh. You deserve millions more subscribers
@statquest Місяць тому
Thank you!
@NewsLetter-sq1eh Місяць тому ⁺²
Your videos are a didactic stroke of genius! 👍
@statquest Місяць тому
Glad you like them!
@liuwingki413 26 днів тому ⁺¹
Thanks for introducing the concepts about transformers
@statquest 25 днів тому
My pleasure!
@itsawonderfullife4802 Місяць тому ⁺¹
Wow Sqatch! Long time no see my friend! Good to see you.
Your videos are so much fun that one does not feel we are actually in the class. Thank you Josh.
@statquest Місяць тому
Thanks!
@adityabhosale7838 Місяць тому ⁺²
Please Add this video in your Neural Network Playlist. I recently started watching that playlist
@statquest Місяць тому ⁺¹
Done!
@Hakilia Місяць тому ⁺¹
following you from 🇨🇩
@statquest Місяць тому
bam!
@statquest Місяць тому ⁺²
The full Neural Networks playlist, from the basics to deep learning, is here: ua-cam.com/video/CqOfi41LfDw/v-deo.html
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@rickymort135 Місяць тому
Just ordered your book 😊 Thanks for the love and care you put into this
@pulse6982 Місяць тому ⁺¹
Doings the god’s work, Josh!
@statquest Місяць тому
Thank you!
@Er1kth3b00s Місяць тому
Amazing video! Can't wait for the next one. By the way, I think there's a small typo at 5:15 where the first query weight in the matrix notation should be 2.22 instead of 0.22
@statquest Місяць тому
Oops! Thanks for catching that!
@kartikchaturvedi7868 Місяць тому ⁺¹
Superrrb Awesome Fantastic video
@statquest Місяць тому
Thanks 🤗!
@BlayneOliver Місяць тому
Josh do you know how to use embedding layers to add context to a regression model?
And do you offer 1-on-1 guidance? I’m stuck on a problem regarding this videos topic
@statquest Місяць тому
Hmmm...I'm not sure about the first question and, unfortunately, I don't offer one-on-one guidance.
@swarnavasarkar8106 Місяць тому
Hey....did you cover the training steps in this video ? Sorry if I missed it
@statquest Місяць тому ⁺²
No, just how the math is done when training. We'll cover more details of training in my next video when we learn how to code transformers in PyTorch.
@loflog Місяць тому
Question: If all tokens can be calculated in parallel, then why is time-to-first-token such an important metric for model performance?
@statquest Місяць тому
That might be related to decoding, which, when doing during inference, is sequential.
@kamiltylus Місяць тому ⁺¹
The time to first token may be referring to producing the first token by the decoder in the autoregressive setting, where (for example in sentence translation) the model produces one token at a time, then feeds it into itself, to generate the next one, and so on. This process is sequential, while the computation of all the matrices (of already existing embeddings) is parallel).
@theneumann7 Місяць тому ⁺¹
perfection
@statquest Місяць тому
Thank you!
@kavinvignesh2832 Місяць тому ⁺²
TRIPLE BAM!!!!!!!!
@statquest Місяць тому ⁺¹
:)
@DmitryPesegov Місяць тому
Great details. But. Please. In the education process it's very important to use some imaginable concepts as a frameworks. For me it's hard to connect why we are doing all that digits with the goal and why it works. Start using the concept of a n-sphere (let it be just a 2D circle, since we are using 2 values for tokens) and explain that we are actually rotating the whole n-spheres (circles) with packed Q and K in them and coding-in cases of different co-directionality of vectors measured by the cosine similarity [-1..1] (and actually divided not by mult of 2-norms but by sqrt of dmnsnlty just for comp.performance (you successfully mentioned this)). And when we are multiplying by V - we are actually doing the "mixing" of values in each dimension wrt QK co-directionality as a vectors in a n-sphere. We rotating the n-spheres by multiplying Q and K by matrices Wq and Wk and when we are doing that it's actually works as a rotation, linear transformation can do more but we will use the cosine similarity after it to measure the alignment of the vectors Q and K. Rotations, co-directionality cases code-ins, mixing. Repeat.
@statquest Місяць тому
Noted!
@farazsyed.2898 Місяць тому
need a video on degrees of freedom!!!
@statquest Місяць тому
Noted!
@yuvalalmog6000 5 днів тому
Will you ever make videos on the subjects of Reinforcement learning, NLP or generative models?
@statquest 5 днів тому
I think you could argue that this video is about NLP and is also a generative model, and I'll keep the other topic in mind.
@yuvalalmog6000 4 дні тому ⁺¹
@@statquest I"ll explain myself better as I admit I phrased it poorly. For deep learning and machine learning you made amazing videos that covered the subjects from basic aspects to advanced ones - thus essentially teaching the whole subject in a fun, creative & enjoyable sequence of videos that can help beginners know it from top to bottom.
However, for NLP for example you did talk about specific subjects like word embedding or auto-translation, but there are other topics (mostly older things) in that field that are important to learn such as n-grams & HMM.
So my question was not only about specific advanced topics that connect to others, but rather about a full course that covers the basics of the subject as well.
Sorry for my bad phrasing and thank you both for your quick answer and amazing videos! 😄
@statquest 4 дні тому ⁺¹
@@yuvalalmog6000 I hope to one day cover HMMs.
@gui-zx3di Місяць тому
Usually "vamos" will not be one token but two. How can the algorithm handle this division?
@statquest Місяць тому
You could split "vamos" into two tokens, "va" and "mos", then the output from the decoder would be "va", "mos", "".
@nivcohen961 Місяць тому ⁺²
Goat
@statquest Місяць тому
:)
@I.II..III...IIIII..... Місяць тому
10:51 How come each token's maximum similarity isn't with itself?
@statquest Місяць тому
This example, trained on just 2 phrases ("what is statquest? and "statquest is what") is too simple to really show off the nuance in how these things work.
@I.II..III...IIIII..... Місяць тому
@@statquestah so with more training and a bigger dataset we can expect the weights to give values closer to what we intuitively expect, like, as I said, each word having the biggest similarity with itself? Great video to see the matrices in action, and I like the content and don't want to be rude, but I think touching on such details a bit would've been nice. Also, maybe something on Multi-Head Attention?
@statquest Місяць тому
@@I.II..III...IIIII..... I believe that is correct. And I'll talk about multi-head attention more in my video on how to code transformers.
@user-yc9do4mb5i Місяць тому
Why they used square root of dk ? Why not just dk? ... If anyone knows the answer please give a good explaination
@statquest Місяць тому
To quote from the original manuscript, "if q and k are independent random variables with mean 0 and variance 1. Then their dot product has mean 0 and variance d_k". Thus, dividing the dot products by the square root of d_k results in variance = 1. That said, unfortunately, as you can see in this illustration, the variance for q and k is much higher than 1, so the theory doesn't actually hold.
@faisalsheikh7846 Місяць тому ⁺¹
Cody finished his story😅
@statquest Місяць тому
One more to go - in the next video we'll code this thing up in pytorch.
@EzraSchroeder Місяць тому
the A B C thing... i think it is inspired by Sesame Street LoL!!!!!!! 🙂
@statquest Місяць тому
:)
@wilfredomartel7781 Місяць тому ⁺¹
🎉
@statquest Місяць тому
:)
@juansilva-fy6cw 23 дні тому
Kolmogorov-Arnold Networks videoooooo mr bam
@statquest 23 дні тому ⁺¹
I'll keep that in mind.
@nivcohen961 Місяць тому ⁺²
You made me love data science if not you I would learn as a zombie
@statquest Місяць тому
bam!
@Keshi-lz3ef Місяць тому ⁺¹
Thanks for the great contents! One minor thing - at 5:24 minute, the first element of the Query weight matrix should be 2.22, but not 0.22
@statquest Місяць тому
Yep. That's a typo.
@felipela2227 20 днів тому
It would be nice if you develop courses of Object Detection, mainly YOLO
@statquest 19 днів тому ⁺¹
I'll keep that in mind.
@DarkNight0411 Місяць тому
With all due respect, please stop singing at the beginning of your videos. Having that at the beginning of every video is very irritating.
@statquest Місяць тому
Noted

Наступне

Автоматичне відтворення