Neural Attention - This simple example will change how you think about it

Поділитися
Вставка
  • Опубліковано 23 гру 2024

КОМЕНТАРІ • 40

  • @avb_fj
    @avb_fj  Рік тому +1

    Just uploaded the second part of the series discussing Self Attention and variants. Link here:
    ua-cam.com/video/4naXLhVfeho/v-deo.html
    Here's future me posting the third part about Transformers:
    ua-cam.com/video/0P6-6KhBmZM/v-deo.html

  • @iSpades0
    @iSpades0 Рік тому +6

    By far one of the best Deep Learning UA-cam channel I have ever checked out!
    Can't wait for part 2 and 3, keep up the good work!

    • @avb_fj
      @avb_fj  Рік тому +1

      Thanks a lot! Nice timing with the comment, I just published the next video a couple hours back! 😀

  • @amoghjain
    @amoghjain Рік тому +3

    please keep making these videos!! your explanations are absolutely amazing, engaging, to the point, intuitive, and very easy to understand!!!

  • @kozer1986
    @kozer1986 Рік тому +6

    The best explanation I've ever seen. It totally clicked to me! Thanks!!

    • @avb_fj
      @avb_fj  Рік тому

      Awesome! Glad to hear it! :)

  • @svenleijnen9045
    @svenleijnen9045 Рік тому +1

    First time ever I comment a video but I just had to: the way you make complex concepts understandable is awesome! Best explanation about attention I’ve come across so far 👍

    • @avb_fj
      @avb_fj  Рік тому

      Nice, I am a fellow non-commenter as well! Glad to see you here, and thanks for all the appreciation!

  • @LexPodgorny
    @LexPodgorny 8 місяців тому

    @2:14 You've suddenly jumped from vector 512 to vector of 2. But how? Please explain what happened there, because I think a key portion of video got cut out. Thanks

    • @avb_fj
      @avb_fj  8 місяців тому +1

      As I mentioned around 3:06, the 2D thing was an example. The Q/K/V embedding size is an arbitrary hyperparameter so it can be set to anything. I used the size 2 example just to illustrate how the "dot product" works since it is easy to show the cosine similarity between two 2D vectors as in 2:51.

    • @LexPodgorny
      @LexPodgorny 8 місяців тому

      @@avb_fj Ah, I got it now. Thank you!
      But how do you actually produce a query embedding from a query. Is there a video on building the key and query neural networks that do that? Especially interesting is the part where query embedding is learned in a way that corresponds to the key embedding vector coordinates, I am assuming using same word embedding for both should take care of it somehow, but it would be great to see the actual technique that is used. Thank you!

    • @avb_fj
      @avb_fj  8 місяців тому

      So an embedding can be obtained by passing your input through a neural network. For the case of text, we can use anything from “word embeddings” or “RNN/LSTMs” etc to convert input text into embeddings. In my channel there are a couple of helpful videos
      ua-cam.com/video/0P6-6KhBmZM/v-deo.html
      ua-cam.com/video/uocYQH0cWTs/v-deo.html
      But there are plenty of resources online too! Good luck!

  • @sahhaf1234
    @sahhaf1234 11 місяців тому

    These videos are prepared in a very thought-provoking way...
    I think the weights/biases of the system are at query, key and value networks given at @15:30 and all training occurs there.. Therefore can we say that what neural attention learns are embeddings? On the other hand, the part softmax(QK^T/sqrt(d_k))V is fixed and apparently it does not learn anything during training..
    Thank you very much again for these very well prepared videos.

    • @avb_fj
      @avb_fj  11 місяців тому +1

      The Query, Keys, and Values are indeed embeddings that must be optimized, by updating the Query, Key, and Value neural networks. The softmax(QK^T/sqrt(d_k))V part is the "computation graph" that inputs the embeddings and transforms them into new "contextually aware embeddings".
      Consider the below example:
      Say: You have 2 inputs a and b. And you want to train a neural net to predict C. You can model your network as Y = F(a) + G(b). That is we are saying: "some function of a and some function of b will add up to be Y. We will optimize the functions F and G such that Y
      1) F and G are analogous to the Q, K, V neural networks.
      2) F(a) and G(b) are analogous to Q, K, V embeddings
      3) The + sign is the computation graph that combines a & b to make a prediction Y. In the attention formula, this + sign is equivalent to softmax(Q . K^T) V.
      4) Finally since we wanted to predict C, not Y. We calculate the loss between C and Y, and then optimize the weights/biases of the neural networks F and G such that Y (our prediction) gets closer to C (our target). The gradients of the loss flows right through the + operation & the F and G.
      So yeah, the input collections are passed through the Q, K, V networks to derive the Q, K, V embeddings. The softmax(...) portion is the computation that combines these embeddings. The softmax(...) operation doesn't contain any parameters to train, as you mentioned, but it forms the backbone/computation graph of how the forward pass and backward propagation work. Hope that helps.

    • @sahhaf1234
      @sahhaf1234 11 місяців тому

      @@avb_fj Thanks a lot.. Right now I'm listening the self-attention part.When I'm done with the third part, I'll return back here and read your reply again more carefully.

  • @sahhaf1234
    @sahhaf1234 11 місяців тому

    Excellent video. My only critique is that the concept of hidden state is used around @13:15 without being defined. After @13:13, it becomes a little bit too fast and concepts become a blur..

    • @avb_fj
      @avb_fj  11 місяців тому +1

      Appreciate all the feedback! Thanks for sharing your experience... I thought getting into LSTMs and RNNs would be a bit of a rabbit hole for this video since not all of it is relevant to the primary topic, so I stayed at the surface level with the hidden state stuff and focussed more on the "attention" portions of the video.

  • @luisfelipearaujodeoliveira469
    @luisfelipearaujodeoliveira469 8 місяців тому +1

    AMAZING TUTORIAL, I am definitely using your video as a recommendation to all my friends that want to learn Deep Learning in an easy way. Greeting from Brazil! And keep up the good work!

    • @avb_fj
      @avb_fj  8 місяців тому

      Thanks!! Totally made my day!

  • @noclaf78
    @noclaf78 6 місяців тому

    Is there a link to your contrastive learning video?

    • @avb_fj
      @avb_fj  6 місяців тому

      Check out the first quarter of this video: Multimodal AI from First Principles - Neural Nets that can see, hear, AND write.
      ua-cam.com/video/-llkMpNH160/v-deo.html

  • @gnorts_mr_alien
    @gnorts_mr_alien Рік тому +1

    you will be a star teacher on youtube if you keep it up (and if that is your goal). thank you this was very good, subscribed.

    • @avb_fj
      @avb_fj  Рік тому

      Wow that’s got to be one of the kindest comment I’ve ever received! Thanks a lot… glad you enjoyed it!

    • @gnorts_mr_alien
      @gnorts_mr_alien Рік тому +1

      you definitely have that special "knack" some of the best teachers have, and have a very soothing tone to boost. eagerly waiting for part 2 and 3. cheers! @@avb_fj

  • @sahhaf1234
    @sahhaf1234 11 місяців тому

    I watch the whole series and it is a real gem.. But something is missing.. Where is the nonlinearity and weights and biases? What do we train?

    • @avb_fj
      @avb_fj  11 місяців тому

      The weights, biases, and non-linearity come from the Query, Key, and Value neural networks. These convert the input embeddings into the query, key, value embeddings respectively - which then go through the attention computation.
      We can also add additional feed-forward layers after the attention layer to add additional transformations/non-linearity.
      Other weights we train can be initial embeddings of the input collection. Look up word-embeddings for example that train special embedding vectors for each word in the vocabulary. There can also be separate neural networks to embed each input type. For example, suppose you are trying to learn attention between a bunch of images and a sentence. The images can have their own image encoding neural network, and the sentence can have a text encoding neural network. All of these nets have their own weights and biases according to whatever the end-goal is. Once the forward-pass is defined, we compute the loss between network prediction and target. Through backpropagation, all learnable parameters then get updated.

  • @adrianjackson1045
    @adrianjackson1045 10 місяців тому

    great video!! your explanations and graphics are amazing. love the content

    • @avb_fj
      @avb_fj  10 місяців тому

      Thanks!😊

  • @venkateshbs1384
    @venkateshbs1384 9 місяців тому

    Very Clear Explanation. Thanks for that.

  • @sahhaf1234
    @sahhaf1234 11 місяців тому

    Maybe an unimportant point, but @2:00 the vector Q sees like a column vector.. I think it should be a row vector..

    • @avb_fj
      @avb_fj  11 місяців тому +1

      Yeah a row vector would be more accurate for the QK^t stuff that happens later. Thanks for pointing that out.

    • @w花b
      @w花b 5 днів тому +1

      I was so confused for the dimensions I was questioning if I had learned matrices right lol. I knew something didn't match. But in numpy they're column by default so we would also need the transpose of Q as well.

  • @serta5727
    @serta5727 Рік тому +1

    Very good explanation, thanks 😄

  • @landerosedgard
    @landerosedgard Рік тому

    great explanation!

  • @sahhaf1234
    @sahhaf1234 8 місяців тому

    I think the embeddings must be normalized for the dot product to make sense.

  • @repairstudio4940
    @repairstudio4940 Рік тому +2

    Is that a One Piece shirt?! Love it man! 🎉
    Also great content tas always.

    • @avb_fj
      @avb_fj  Рік тому

      Haha thanks!🙏🏽

  • @ahnafsamin3777
    @ahnafsamin3777 9 місяців тому

    Good videos but a bit fast

  • @googleyoutubechannel8554
    @googleyoutubechannel8554 7 місяців тому

    Public service announcement: this is overview is unintelligible as is. Sorry, going to have to downvote this, too many completely unexplained abstractions and jargon, and glossing over the key details that are absolutely required for anyone to grasp 'attention' mechanisms. eg. you don't even define 'mean pooling' etc, (this isn't even a common ML term, I've written a transformer and have never heard this jargon used.)