The matrix math behind transformer neural networks, one step at a time!!!

Поділитися
Вставка
  • Опубліковано 31 тра 2024
  • Transformers, the neural network architecture behind ChatGPT, do a lot of math. However, this math can be done quickly using matrix math because GPUs are optimized for it. Matrix math is also used when we code neural networks, so learning how ChatGPT does it will help you code your own. Thus, in this video, we go through the math one step at a time and explain what each step does so that you can use it on your own with confidence.
    NOTE: This StatQuest assumes that you are already familiar with:
    Transformers: • Transformer Neural Net...
    The essential matrix algebra for neural networks: • Decoder-Only Transform...
    If you'd like to support StatQuest, please consider...
    Patreon: / statquest
    ...or...
    UA-cam Membership: / @statquest
    ...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
    statquest.org/statquest-store/
    ...or just donating to StatQuest!
    paypal: www.paypal.me/statquest
    venmo: @JoshStarmer
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    1:43 Word Embedding
    3:37 Position Encoding
    4:28 Self Attention
    12:09 Residual Connections
    13:08 Decoder Word Embedding and Position Encoding
    15:33 Masked Self Attention
    20:18 Encoder-Decoder Attention
    21:31 Fully Connected Layer
    22:16 SoftMax
    #StatQuest #Transformer #ChatGPT

КОМЕНТАРІ • 88

  • @samglick8479
    @samglick8479 Місяць тому +10

    Josh Starmer is the GOAT. Literally every morning I wake up with some statquest, and it really helps me get ready for my statistics classes for the day. Thank you Josh!

  • @NJCLM
    @NJCLM 12 днів тому +1

    Very educational, and also innovative in the way of doing it. I have never seen such teaching elsewhere. You are the BEST !

  • @colekeircom
    @colekeircom Місяць тому +1

    As an electronics hobbyist/student from way back in the 70s I like to keep up as best I can with technology. I'm really glad I don't have to remember all the details in this series. There are so many layers upon layers that at times I do ''just keep going to the end'' of the videos. Nevertheless I still manage to learn key aspects and new terms from your excellent teaching abilities. There must be an incredible amount of work involved in creating these lessons.
    I will purchase your book because you deserve some form of appreciation and it'll serve as a great reference resource. Much respect Josh and thanks , Kieron.

    • @statquest
      @statquest  Місяць тому +1

      Thank you very much!

  • @jordantran3102
    @jordantran3102 Місяць тому +3

    You weren't kidding, it's here! You're a man of your word and a man of the people.

  • @roberto2912
    @roberto2912 Місяць тому +3

    Josh! Thanks for this video, it has been easier for me to see the matricial representation of the computation than using the previous arrows. I really appreciate your explanation using matrices!

    • @statquest
      @statquest  Місяць тому

      Glad it was helpful!

  • @mraarone
    @mraarone Місяць тому +5

    DUDE JOSH, FINALLY! I have been waiting for this episode for a year or more. I’m so proud of you bro. You got there!

  • @TheCJD89
    @TheCJD89 Місяць тому +1

    This is really good. The simple example you used was very effective for demonstrating the inner workings of the transformer.

    • @statquest
      @statquest  Місяць тому

      Thank you very much!

  • @roro5179
    @roro5179 Місяць тому +1

    always been a huge fan of the channel and at this point in my life this video really couldn't have come at a better time. Thanks for enabling helping us viewers with some of the best content on the planet (I said what I said)!

  • @Aa-fk8jg
    @Aa-fk8jg 18 днів тому +1

    statquest's the best thing i ever found on the internet

  • @MakeDataUseful
    @MakeDataUseful Місяць тому +1

    Amazing, thank you Josh. You deserve millions more subscribers

  • @NewsLetter-sq1eh
    @NewsLetter-sq1eh Місяць тому +2

    Your videos are a didactic stroke of genius! 👍

  • @liuwingki413
    @liuwingki413 26 днів тому +1

    Thanks for introducing the concepts about transformers

  • @itsawonderfullife4802
    @itsawonderfullife4802 Місяць тому +1

    Wow Sqatch! Long time no see my friend! Good to see you.
    Your videos are so much fun that one does not feel we are actually in the class. Thank you Josh.

  • @adityabhosale7838
    @adityabhosale7838 Місяць тому +2

    Please Add this video in your Neural Network Playlist. I recently started watching that playlist

  • @Hakilia
    @Hakilia Місяць тому +1

    following you from 🇨🇩

  • @statquest
    @statquest  Місяць тому +2

    The full Neural Networks playlist, from the basics to deep learning, is here: ua-cam.com/video/CqOfi41LfDw/v-deo.html
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @rickymort135
      @rickymort135 Місяць тому

      Just ordered your book 😊 Thanks for the love and care you put into this

  • @pulse6982
    @pulse6982 Місяць тому +1

    Doings the god’s work, Josh!

  • @Er1kth3b00s
    @Er1kth3b00s Місяць тому

    Amazing video! Can't wait for the next one. By the way, I think there's a small typo at 5:15 where the first query weight in the matrix notation should be 2.22 instead of 0.22

    • @statquest
      @statquest  Місяць тому

      Oops! Thanks for catching that!

  • @kartikchaturvedi7868
    @kartikchaturvedi7868 Місяць тому +1

    Superrrb Awesome Fantastic video

  • @BlayneOliver
    @BlayneOliver Місяць тому

    Josh do you know how to use embedding layers to add context to a regression model?
    And do you offer 1-on-1 guidance? I’m stuck on a problem regarding this videos topic

    • @statquest
      @statquest  Місяць тому

      Hmmm...I'm not sure about the first question and, unfortunately, I don't offer one-on-one guidance.

  • @swarnavasarkar8106
    @swarnavasarkar8106 Місяць тому

    Hey....did you cover the training steps in this video ? Sorry if I missed it

    • @statquest
      @statquest  Місяць тому +2

      No, just how the math is done when training. We'll cover more details of training in my next video when we learn how to code transformers in PyTorch.

  • @loflog
    @loflog Місяць тому

    Question: If all tokens can be calculated in parallel, then why is time-to-first-token such an important metric for model performance?

    • @statquest
      @statquest  Місяць тому

      That might be related to decoding, which, when doing during inference, is sequential.

    • @kamiltylus
      @kamiltylus Місяць тому +1

      The time to first token may be referring to producing the first token by the decoder in the autoregressive setting, where (for example in sentence translation) the model produces one token at a time, then feeds it into itself, to generate the next one, and so on. This process is sequential, while the computation of all the matrices (of already existing embeddings) is parallel).

  • @theneumann7
    @theneumann7 Місяць тому +1

    perfection

  • @kavinvignesh2832
    @kavinvignesh2832 Місяць тому +2

    TRIPLE BAM!!!!!!!!

  • @DmitryPesegov
    @DmitryPesegov Місяць тому

    Great details. But. Please. In the education process it's very important to use some imaginable concepts as a frameworks. For me it's hard to connect why we are doing all that digits with the goal and why it works. Start using the concept of a n-sphere (let it be just a 2D circle, since we are using 2 values for tokens) and explain that we are actually rotating the whole n-spheres (circles) with packed Q and K in them and coding-in cases of different co-directionality of vectors measured by the cosine similarity [-1..1] (and actually divided not by mult of 2-norms but by sqrt of dmnsnlty just for comp.performance (you successfully mentioned this)). And when we are multiplying by V - we are actually doing the "mixing" of values in each dimension wrt QK co-directionality as a vectors in a n-sphere. We rotating the n-spheres by multiplying Q and K by matrices Wq and Wk and when we are doing that it's actually works as a rotation, linear transformation can do more but we will use the cosine similarity after it to measure the alignment of the vectors Q and K. Rotations, co-directionality cases code-ins, mixing. Repeat.

  • @farazsyed.2898
    @farazsyed.2898 Місяць тому

    need a video on degrees of freedom!!!

  • @yuvalalmog6000
    @yuvalalmog6000 5 днів тому

    Will you ever make videos on the subjects of Reinforcement learning, NLP or generative models?

    • @statquest
      @statquest  5 днів тому

      I think you could argue that this video is about NLP and is also a generative model, and I'll keep the other topic in mind.

    • @yuvalalmog6000
      @yuvalalmog6000 4 дні тому +1

      ​@@statquest I"ll explain myself better as I admit I phrased it poorly. For deep learning and machine learning you made amazing videos that covered the subjects from basic aspects to advanced ones - thus essentially teaching the whole subject in a fun, creative & enjoyable sequence of videos that can help beginners know it from top to bottom.
      However, for NLP for example you did talk about specific subjects like word embedding or auto-translation, but there are other topics (mostly older things) in that field that are important to learn such as n-grams & HMM.
      So my question was not only about specific advanced topics that connect to others, but rather about a full course that covers the basics of the subject as well.
      Sorry for my bad phrasing and thank you both for your quick answer and amazing videos! 😄

    • @statquest
      @statquest  4 дні тому +1

      @@yuvalalmog6000 I hope to one day cover HMMs.

  • @gui-zx3di
    @gui-zx3di Місяць тому

    Usually "vamos" will not be one token but two. How can the algorithm handle this division?

    • @statquest
      @statquest  Місяць тому

      You could split "vamos" into two tokens, "va" and "mos", then the output from the decoder would be "va", "mos", "".

  • @nivcohen961
    @nivcohen961 Місяць тому +2

    Goat

  • @I.II..III...IIIII.....
    @I.II..III...IIIII..... Місяць тому

    10:51 How come each token's maximum similarity isn't with itself?

    • @statquest
      @statquest  Місяць тому

      This example, trained on just 2 phrases ("what is statquest? and "statquest is what") is too simple to really show off the nuance in how these things work.

    • @I.II..III...IIIII.....
      @I.II..III...IIIII..... Місяць тому

      ​@@statquestah so with more training and a bigger dataset we can expect the weights to give values closer to what we intuitively expect, like, as I said, each word having the biggest similarity with itself? Great video to see the matrices in action, and I like the content and don't want to be rude, but I think touching on such details a bit would've been nice. Also, maybe something on Multi-Head Attention?

    • @statquest
      @statquest  Місяць тому

      @@I.II..III...IIIII..... I believe that is correct. And I'll talk about multi-head attention more in my video on how to code transformers.

  • @user-yc9do4mb5i
    @user-yc9do4mb5i Місяць тому

    Why they used square root of dk ? Why not just dk? ... If anyone knows the answer please give a good explaination

    • @statquest
      @statquest  Місяць тому

      To quote from the original manuscript, "if q and k are independent random variables with mean 0 and variance 1. Then their dot product has mean 0 and variance d_k". Thus, dividing the dot products by the square root of d_k results in variance = 1. That said, unfortunately, as you can see in this illustration, the variance for q and k is much higher than 1, so the theory doesn't actually hold.

  • @faisalsheikh7846
    @faisalsheikh7846 Місяць тому +1

    Cody finished his story😅

    • @statquest
      @statquest  Місяць тому

      One more to go - in the next video we'll code this thing up in pytorch.

  • @EzraSchroeder
    @EzraSchroeder Місяць тому

    the A B C thing... i think it is inspired by Sesame Street LoL!!!!!!! 🙂

  • @wilfredomartel7781
    @wilfredomartel7781 Місяць тому +1

    🎉

  • @juansilva-fy6cw
    @juansilva-fy6cw 23 дні тому

    Kolmogorov-Arnold Networks videoooooo mr bam

    • @statquest
      @statquest  23 дні тому +1

      I'll keep that in mind.

  • @nivcohen961
    @nivcohen961 Місяць тому +2

    You made me love data science if not you I would learn as a zombie

  • @Keshi-lz3ef
    @Keshi-lz3ef Місяць тому +1

    Thanks for the great contents! One minor thing - at 5:24 minute, the first element of the Query weight matrix should be 2.22, but not 0.22

  • @felipela2227
    @felipela2227 20 днів тому

    It would be nice if you develop courses of Object Detection, mainly YOLO

    • @statquest
      @statquest  19 днів тому +1

      I'll keep that in mind.

  • @DarkNight0411
    @DarkNight0411 Місяць тому

    With all due respect, please stop singing at the beginning of your videos. Having that at the beginning of every video is very irritating.