The Attention Mechanism in Large Language Models

Поділитися
Вставка
  • Опубліковано 3 січ 2025
  • Attention mechanisms are crucial to the huge boom LLMs have recently had.
    In this video you'll see a friendly pictorial explanation of how attention mechanisms work in Large Language Models.
    This is the first of a series of three videos on Transformer models.
    Video 1: The attention mechanism in high level (this one)
    Video 2: The attention mechanism with math: • The math behind Attent...
    Video 3: Transformer models • What are Transformer M...
    Learn more in LLM University! llm.university

КОМЕНТАРІ •

  • @arvindkumarsoundarrajan9479
    @arvindkumarsoundarrajan9479 11 місяців тому +56

    I have been reading the "attention is all you need" paper for like 2 years. Never understood it properly like this ever before😮. I'm so happy now🎉

  • @drdr3496
    @drdr3496 10 місяців тому +3

    This is a great video (as are the other 2) but one thing that needs to be clarified is that the embeddings themselves do not change (by attention @10:49). The gravity pull analogy is appropriate but the visuals give the impression that embedding weights change. What changes is the context vector.

  • @RG-ik5kw
    @RG-ik5kw Рік тому +39

    Your videos in the LLM uni are incredible. Builds up true understanding after watching tons of other material that was all a bit loose on the ends. Thank you!

  • @GrahamAnderson-z7x
    @GrahamAnderson-z7x 8 місяців тому +5

    I love your clear, non-intimidating, and visual teaching style.

    • @SerranoAcademy
      @SerranoAcademy  8 місяців тому +1

      Thank you so much for your kind words and your kind contribution! It’s really appreciated!

  • @malikkissoum730
    @malikkissoum730 Рік тому +16

    Best teacher on the internet, thank you for your amazing work and the time you took to put those videos together

  • @MrProgrammer-yr1ed
    @MrProgrammer-yr1ed Місяць тому +2

    This video is amazing!
    Appreciate Luis for his skill of explaining PhD level concepts as easier that 9th grade student can understand.
    I found this channel is a diamond mine for beginners.
    Thanks Luis.

  • @gunjanmimo
    @gunjanmimo Рік тому +9

    This is one of the best videos on UA-cam to understand ATTENTION. Thank you for creating such outstanding content. I am waiting for upcoming videos of this series. Thank you ❤

  • @saeed577
    @saeed577 10 місяців тому +3

    THE best explanation of this concept. That was genuinely amazing.

  • @Compsci-v6q
    @Compsci-v6q 3 місяці тому +2

    This channel is uderrated, your explainations is the best among other channels

  • @bobae1357
    @bobae1357 10 місяців тому +4

    best description ever! easy to understand. I've been suffered to understanding attention. Finally I can tell I know it!

  • @JyuSub
    @JyuSub 10 місяців тому +3

    Just THANK YOU. This is by far the best video on the attention mechanism for people that learn visually

  • @EricMutta
    @EricMutta Рік тому +19

    Truly amazing video! The published papers never bother to explain things with this level of clarity and simplicity, which is a shame because if more people outside the field understood what is going on, we may have gotten something like ChatGPT about 10 years sooner! Thanks for taking the time to make this - the visual presentation with the little animations makes a HUGE difference!

  • @FawadMahdi-o2h
    @FawadMahdi-o2h 3 місяці тому +1

    This was hands down the best explanation I've seen of attention mechanisms and multi head attention --- the fact I'm able to use these words in this sentence means I understand it

  • @k.i.a7240
    @k.i.a7240 13 днів тому

    The world needs people like Serrano more, who explain the shit out of ambiguities and lead us back to the age of wisdom.

  • @ronitakhariya4094
    @ronitakhariya4094 Місяць тому

    absolutely loved the last part with explaining linear transformations of query key and values. thank you so much!

  • @anipacify1163
    @anipacify1163 10 місяців тому +1

    Omg this video is on a whole new level . This is prolly the best intuition behind the transformers and attention. Best way to understand. I went thro' a couple of videos online and finally found the best one . Thanks a lot ! Helped me understand the paper easily

  • @calum.macleod
    @calum.macleod Рік тому +11

    I appreciate your videos, especially how you can apply a good perspective to understand the high level concepts, before getting too deep into the maths.

  • @TheMircus224
    @TheMircus224 Рік тому +1

    These videos where you explain the transformers are excellent. I have gone through a lot of material however, it is your videos that have allowed me to understand the intuition behind these models. Thank you very much!

  • @Aidin-f5v
    @Aidin-f5v Місяць тому +1

    That was awesome, Thank you.
    You saved me a lot of time reading and watching none-sense videos and texts
    .

  • @mohameddjilani4109
    @mohameddjilani4109 Рік тому +1

    I really enjoyed how you give a clear explanation of the operations and the representations used in attention

  • @apah
    @apah Рік тому +4

    So glad to see you're still active Luis ! You and Statquest's Josh Stamer really are the backbone of more ml professionals than you can imagine

  • @aadeshingle7593
    @aadeshingle7593 Рік тому +4

    One of the best intuitions for understanding multi-head attention. Thanks a lot!❣

  • @nealdavar939
    @nealdavar939 8 місяців тому +1

    The way you break down these concepts is insane. Thank you

  • @ccgarciab
    @ccgarciab 10 місяців тому +2

    This is such a good, clear and concise video. Great job!

  • @amoghjain
    @amoghjain Рік тому +2

    Thank you for making this video series for the sake of a learner and not to show off your own knowledge!! Great anecdotes and simple examples really helped me understand the key concepts!!

  • @pruthvipatel8720
    @pruthvipatel8720 Рік тому +7

    I always struggled with KQV in attention paper. Thanks a lot for this crystal clear explanation!
    Eagerly looking forward to the next videos on this topic.

  • @sayamkumar7276
    @sayamkumar7276 Рік тому +10

    This is one of the clearest, simplest and the most intuitive explanations on attention mechanism.. Thanks for making such a tedious and challenging concept of attention relatively easy to understand 👏 Looking forward to the impending 2 videos of this series on attention

  • @decryptifi2265
    @decryptifi2265 Місяць тому

    What a beautiful way of explaining "Attention Mechanism". Great job Serano

  • @ajnbin
    @ajnbin Рік тому +1

    Fantastic !!! The explanation itself is a piece of art.
    The step by step approach, the abstractions, ... Kudos!!
    Please more of these

  • @arulbalasubramanian9474
    @arulbalasubramanian9474 Рік тому +1

    Great explanation. After watching a handful of videos this one really makes it real easy to understand.

  • @docodemo727
    @docodemo727 Рік тому +1

    this video is really teaching you the intuition. much better than the others I went through that just throw formula to you. thanks for the great job!

  • @guru7856
    @guru7856 3 місяці тому

    Thank you for your explanation! I've always wondered why the attention mechanism in Transformers produces more effective embeddings compared to Word2Vec, and your video clarified this well. Word2Vec generates static embeddings, meaning that a word always has the same representation, regardless of the context in which it appears. In contrast, Transformers create context-dependent embeddings, where the representation of a word is influenced by the words around it. This dynamic approach is what makes Transformer embeddings so powerful.

  • @rikiakbar4025
    @rikiakbar4025 5 місяців тому

    Thanks Luis, been following your contents for a while. This video about attention mechanism is very intuitive and easy to follow

  • @pranayroy
    @pranayroy 10 місяців тому +1

    Kudos to your efforts in clear explanation!

  • @abu-yousuf
    @abu-yousuf Рік тому +1

    amazing explanation Luis. Can't thank you enough for your amazing work. You have a special gift to explain things. Thanks.

  • @PedroTrujilloV
    @PedroTrujilloV 2 місяці тому

    Thanks!

  • @mostinho7
    @mostinho7 Рік тому +1

    7:00 even with word embedding, words can be missing context and there’s no way to tell like the word apple. Are you taking about the company or the fruit?
    Attention matches each word of the input with every other word, in order to transform it or pull it towards a different location in the embedding based on the context. So when the sentence is “buy apple and orange” the word orange will cause the word apple to have an embedding or vector representation that’s closer to the fruit
    8:00

  • @s.chandrasekhar8290
    @s.chandrasekhar8290 Рік тому

    ¡Gracias!

    • @SerranoAcademy
      @SerranoAcademy  Рік тому +1

      Muchisimas gracias por tu colaboración!!! Que amable!

  • @iliasp4275
    @iliasp4275 7 місяців тому +1

    Excellent video. Best explanation on the internet !

  • @kevon217
    @kevon217 Рік тому +1

    Wow, clearest example yet. Thanks for making this!

  • @karlbooklover
    @karlbooklover Рік тому +2

    best explanation of embeddings I've seen, thank you!

  • @JorgeMartinez-xb2ks
    @JorgeMartinez-xb2ks Рік тому

    El mejor video que he visto sobre la materia. Muchísimas gracias por este gran trabajo.

  • @sari54754
    @sari54754 Рік тому +1

    The most easy to understand video for the subject I've seen.

  • @soumen_das
    @soumen_das Рік тому +2

    Hey Louis, you are AMAZING! Your explanations are incredible.

  • @agbeliemmanuel6023
    @agbeliemmanuel6023 Рік тому +3

    Wooow thanks so much. You are a treasure to the world. Amazing teacher of our time.

  • @MikeTon
    @MikeTon 11 місяців тому

    This clarifies EMBEDDED matrices :
    - In particular the point on how a book isn't just a RANDOM array of words, Matrices are NOT a RANDOM array of numbers
    - Visualization for the transform and shearing really drives home the V, Q, K aspect of the attention matrix that I have been STRUGGLING to internalize
    Big, big thanks for putting together this explanation!

  • @tanggenius3371
    @tanggenius3371 6 місяців тому

    Thanks, the explaination is so intuitive. Finally understood the idea of attention.

  • @yairbh
    @yairbh 5 місяців тому

    Great explanation with the linear transformation matrices. Thanks!

  • @RamiroMoyano
    @RamiroMoyano Рік тому +1

    This is amazingly clear! Thank for your your work!

  • @dr.mikeybee
    @dr.mikeybee Рік тому +2

    Nicely done! This gives a great explanation of the function and value of the projection matrices.

  • @hyyue7549
    @hyyue7549 Рік тому +3

    If I understand correctly, the transformer is basically a RNN model which got intercepted by bunch of different attention layers. Attention layers redo the embeddings every time when there is a new word coming in, the new embeddings are calculated based on current context and new word, then the embeddings will be sent to the feed forward layer and behave like the classic RNN model.

    • @lohithArcot
      @lohithArcot 4 місяці тому

      Can anyone confirm this?

  • @davutumut1469
    @davutumut1469 Рік тому +1

    amazing, love your channel. It's certainly underrated.

  • @mayyutyagi
    @mayyutyagi 6 місяців тому

    Amazing video... Thanks sir for this pictorial representation and explaining this complex topic with such an easy way.

  • @justthefactsplease
    @justthefactsplease 9 місяців тому +1

    What a great explanation on this topic! Great job!

  • @hkwong74531
    @hkwong74531 11 місяців тому

    I subscribe your channel immediately after watching this video, the first video I watch from your channel but also the first making me understand why embedding needs to be multiheaded. 👍🏻👍🏻👍🏻👍🏻

  • @eddydewaegeneer9514
    @eddydewaegeneer9514 8 місяців тому

    Great video and very intuitive explenation of attention mechanism

  • @homakashefiamiri3749
    @homakashefiamiri3749 3 місяці тому

    It was the most useful video explaining attention mechanism. Thank you

  • @perpetuallearner8257
    @perpetuallearner8257 Рік тому +1

    You're my fav teacher. Thank you Luis 😊

  • @caryjason4171
    @caryjason4171 9 місяців тому

    This video helps to explain the concept in a simple way.

  • @satvikparamkusham7454
    @satvikparamkusham7454 Рік тому

    This is the most amazing video on "Attention is all you need"

  • @alijohnnaqvi6383
    @alijohnnaqvi6383 11 місяців тому +1

    What a great video man!!! Thanks for making such videos.

  • @唐伟祚-j4v
    @唐伟祚-j4v 9 місяців тому

    It's so great, I finally understand these qkvs, it bothers me so long. Thank you so much !!!

  • @muhammetibrahimkaraman7471
    @muhammetibrahimkaraman7471 3 місяці тому

    I've really enjoyed with that way of you described and demonstrated matrices as linear transformations. Thank you! Why, because I like Linear Algebra 😄

  • @Omsip123
    @Omsip123 7 місяців тому +1

    Outstanding, thank you for this pearl of knowledge!

  • @cyberpunkdarren
    @cyberpunkdarren 10 місяців тому

    Very impressed with this channel and presenter

  • @kafaayari
    @kafaayari Рік тому

    Well the gravity example is how I understood this after a long time. you are true legend.

  • @bananamaker4877
    @bananamaker4877 Рік тому +1

    Explained very well. Thank you so much.

  • @DeepakSharma-xg5nu
    @DeepakSharma-xg5nu 10 місяців тому

    I did not even realize this video is 21 minutes long. Great explanation.

  • @ThinkGrowIndia
    @ThinkGrowIndia Рік тому +1

    Amazing! Loved it! Thanks a lot Serrano!

  • @dragolov
    @dragolov Рік тому +1

    Deep respect, Luis Serrano! Thank you so much!

  • @BhuvanDwarasila-y8x
    @BhuvanDwarasila-y8x 3 місяці тому

    Thank you so much for the attention to the topic!

    • @SerranoAcademy
      @SerranoAcademy  3 місяці тому

      Thanks! Lol, I see what you did there! :D

  • @orcunkoraliseri9214
    @orcunkoraliseri9214 10 місяців тому

    I watched a lot about attentions. You are the best. Thank you thank you. I am also learning how to explain of a subject from you 😊

  • @VenkataraoKunchangi-uy4tg
    @VenkataraoKunchangi-uy4tg 7 місяців тому

    Thanks for sharing. Your videos are helping me in my job. Thank you.

  • @Cdictator
    @Cdictator 6 місяців тому

    This is amazing explanation! Thank you so much 🎉

  • @LuisOtte-pk4wd
    @LuisOtte-pk4wd 11 місяців тому

    Luis Serrano you have a gift for explain! Thank you for sharing!

  • @ignacioruiz3732
    @ignacioruiz3732 10 місяців тому

    Outstanding video. Amazing to gain intuition.

  • @WhatsAI
    @WhatsAI Рік тому +1

    Amazing explanation Luis! As always...

  • @erickdamasceno
    @erickdamasceno Рік тому +2

    Great explanation. Thank you very much for sharing this.

  • @sathyanukala3409
    @sathyanukala3409 10 місяців тому

    Excellent explanation. Thank you very much.

  • @arshmaanali714
    @arshmaanali714 5 місяців тому

    Superb explanation❤ please make more videos like this

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf 6 місяців тому

    Valeu!

    • @SerranoAcademy
      @SerranoAcademy  5 місяців тому

      @DiegoSilva-dv9uf Thank you so much for your kind contribution Diego!

  • @SulkyRain
    @SulkyRain Рік тому

    Amazing explanation 🎉

  • @tvinay8758
    @tvinay8758 Рік тому

    This is an great explanation of attention mechanism . I have enjoyed your maths for machine learning on coursera. Thank you for creating such wonderful videos

  • @debarttasharan
    @debarttasharan Рік тому +1

    Incredible explanation. Thank you so much!!!

  • @vishnusharma_7
    @vishnusharma_7 Рік тому

    You are great at teaching Mr. Luis

  • @orcunkoraliseri9214
    @orcunkoraliseri9214 10 місяців тому

    Wooow. Such a good explanation for embedding. Thanks 🎉

  • @jayanthAILab
    @jayanthAILab 9 місяців тому

    Wow wow wow! I enjoyed the video. Great teaching sir❤❤

  • @sukhpreetlotey1172
    @sukhpreetlotey1172 10 місяців тому

    First of all thank you for making these great walkthroughs of the architecture. I would really like to support your effort on this channel. let me know how I can do that. thanks

    • @SerranoAcademy
      @SerranoAcademy  9 місяців тому

      Thank you so much, I really appreciate that! Soon I'll be implementing subscriptions, so you can subscribe to the channel and contribute (also get some perks). Please stay tuned, I'll publish it here and also on social media. :)

  • @neelkamal3357
    @neelkamal3357 3 місяці тому +1

    I didn't get it on why do we add linear transformation like earlier too we had embeddings in other planes then why do shear transformation ? Please someone answer

  • @HoussamBIADI
    @HoussamBIADI 6 місяців тому

    Thank you for this amazing explanation

  • @r.k.vignesh7832
    @r.k.vignesh7832 2 місяці тому

    0:55 I thought Attention mechanisms had been around for a while before this paper, e.g. Bahdanu et Al (2014) and likely even earlier than that in some form, and this paper really served as i) an illustration that attention was...well, all you needed and ii) the introduction of the Transformer model architecture?

  • @赵赵宇哲
    @赵赵宇哲 Рік тому

    This video is really clear!

  • @jeffpatrick787
    @jeffpatrick787 Рік тому

    This was great - really well done!

  • @aaalexlit
    @aaalexlit Рік тому

    That's an awesome explanation! Thanks!

  • @maysammansor
    @maysammansor 10 місяців тому

    you are a great teacher. Thank you

  • @bbarbny
    @bbarbny 7 місяців тому

    Amazing video, thank you very much for sharing!

  • @bengoshi4
    @bengoshi4 Рік тому

    Yeah!!!! Looking forward to the second one!! 👍🏻😎

  • @notprof
    @notprof Рік тому

    Thank you so much for making these videos!

  • @tantzer6113
    @tantzer6113 Рік тому +1

    Paraphrase: we weigh each embedding by its score, and then add up all these weighted embeddings to obtain a really good embedding. Question to think about: why not just take the best embedding? Is it because averaging improves robustness to noise?

    • @SerranoAcademy
      @SerranoAcademy  Рік тому +3

      That is a great question! Yes, one thing is because of robustness. Also, each embedding may capture different things, one could be good for a certain topic (say, fruits) but terrible at others (say, technology).
      Another reason is because of continuity. Let's say that you have embedding A, which has the highest score. The moment embedding B gets a higher score, you would switch abruptly from A to B, which creates a jump discontinuity. If you take the average, instead, you would smoothly go from, say 0.51*A + 0.49*B, into 0.49^A + 0.51*B, which is very similar.

    • @tantzer6113
      @tantzer6113 Рік тому

      Thanks for the answer, and for the wonderful video.

    • @tantzer6113
      @tantzer6113 Рік тому

      Maybe the next video will clarify how the weighting is achieved. At first I thought the V matrix provides the weighting of the different embeddings, but now I am not sure.

    • @SerranoAcademy
      @SerranoAcademy  Рік тому

      @@tantzer6113 yes! I thought the exact same thing, but then someone showed me they it doesn’t, those weights are recorded inside the transformer. I’m seeing that the V matrix is another embedding in which the transformation is made (and the K and Q are used to find the distances). But I’ll clarify this more in the next video.

  • @bravulo
    @bravulo Рік тому

    Thanks. I saw also your "Math behind" video, but still missing the third in the series.

    • @SerranoAcademy
      @SerranoAcademy  Рік тому +2

      Thanks! The third video is out now! ua-cam.com/video/qaWMOYf4ri8/v-deo.html