Transformers explained | The architecture behind LLMs

AI Coffee Break with Letitia

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 30 гру 2024
Наука та технологія

КОМЕНТАРІ • 113

@YuraCCC 11 місяців тому ⁺¹⁵
Thanks for the explanation. At 9:19 : Shouldn't the order of multiplication be the opposite here? E.g. x1(vector) * Wq(matrix) = q1(vector). Otherwise I don't understand how we get the 1x3 dimensionality at the end
@AICoffeeBreak 11 місяців тому ⁺⁹
Oh, shoot, messed up the order in the animations there. You are right. Sorry, pinning your comment.
@YuraCCC 11 місяців тому ⁺¹
No problem thanks for clarifying that, and thanks again for the great video@@AICoffeeBreak
@scifaipy9301 6 місяців тому
The vectors should be column vectors.
@MachineLearningStreetTalk 11 місяців тому ⁺⁵
Epic as always 🤌
@AICoffeeBreak 11 місяців тому ⁺¹
Thanks, Tim!
@420_gunna 11 місяців тому ⁺⁶
Awesome video, thank you! I love the idea of you revisiting older topics -- either as a 201 or as a re-introduction. "Attention combines the representation of input vector's value vectors, weighted by the importance score (computed by the query and key vectors)."
@AICoffeeBreak 11 місяців тому ⁺³
Thanks for your appreciation!
@abhishek-tandon 11 місяців тому ⁺⁷
One of the best videos on transformers that I have ever watched. Views 📈
@AICoffeeBreak 11 місяців тому ⁺¹
Do you have examples of others you liked?
@dannown 11 місяців тому ⁺⁴
Really appreciate this video.
@AICoffeeBreak 11 місяців тому
So glad!
@tildarusso 11 місяців тому ⁺⁴
As far as I am aware, word embedding has changed from legacy static embedding like Word2Vec/GLOVE (like the famous queen=woman+king-man metaphor) to BPE & unigram, this change gave me quite a headache, as most of paper do not mention any detail of their "word embedding". Perhaps Letitia you can make a video to clarify this a bit for us.
@AICoffeeBreak 11 місяців тому ⁺¹
Great suggestion, thanks!
@DerPylz 11 місяців тому ⁺¹²
Wow, you've come a long way since your first transformer explained video!
@MuruganR-tg9yt 10 місяців тому ⁺³
Thank you. Nice explanation 😊
@AICoffeeBreak 10 місяців тому ⁺¹
Thank You for your visit!
@DaveJ6515 11 місяців тому ⁺⁹
You know how to explain things. This one is not easy: I can see the amount of work that went into this video, and it was a lot. I hope that your career takes you where you deserve.
@AICoffeeBreak 11 місяців тому ⁺¹
Thanks for watching and thanks for the kind words. All the best to you as well!
@zahrashah6567 8 місяців тому ⁺¹
What a wonderful explanation😍 Just discovered your channel and absolutely loving the explanations as well as visuals😘
@AICoffeeBreak 8 місяців тому ⁺²
Thank you! welcome!
@xyphos915 11 місяців тому ⁺⁹
Wow, this explanation on the difference between RNNs and Transformers at the end is what I was missing!
I've always heard that Transformers are great because of parallelization but never really saw why until today, thank you! Great video!
@AICoffeeBreak 11 місяців тому ⁺²
Oh, this makes me happy !
@connor-shorten 11 місяців тому ⁺⁵
Awesome! Epic Visuals!
@AICoffeeBreak 11 місяців тому ⁺¹
Thanks, Connor!
@cosmic_reef_17 11 місяців тому ⁺⁵
Thank you very much for the very clear explanations and detailed analysis of the transformer architecture. Your truly the 3blue1brown of machine learning!
@AICoffeeBreak 11 місяців тому ⁺¹
@l.suurmeijer1382 11 місяців тому ⁺⁵
Absolute banger of a video. Wish I had seen this when I was learning about transformers in uni last year :-)
@AICoffeeBreak 11 місяців тому ⁺¹
Haha, glad I could help. Even if a bit late.
@partywen 6 місяців тому ⁺²
Super informative and helpful! Thanks a lot!
@AICoffeeBreak 6 місяців тому ⁺¹
Oh wow, thanks!
@jcneto25 11 місяців тому ⁺⁴
Best didatic explanation about Transformers so far. Thank you for sharing it.
@AICoffeeBreak 10 місяців тому ⁺¹
Wow, thanks! Glad it's helpful.
@Thomas-gk42 11 місяців тому ⁺⁶
Understood about 10%, but I like these vidoes and feel intuitively the usefulness.
@AICoffeeBreak 11 місяців тому ⁺³
@rahulrajpvr7d 11 місяців тому ⁺⁶
Tomorrow i have thesis evaluation and i was thinking about watching that video again, but youtube algorithm suggested me without searching anything, Thank u youtube algo..
😅❤🔥
@AICoffeeBreak 11 місяців тому ⁺³
It read your mind.
@DatNgo-uk4ft 11 місяців тому ⁺⁴
Great Video!! Nice improvement over the original
@AICoffeeBreak 11 місяців тому ⁺²
Glad you think so!
@manuelafernandesblancorodr6366 11 місяців тому ⁺³
What a wonderful video! Thank you so much for sharing it!
@AICoffeeBreak 10 місяців тому ⁺¹
Thank you too for this wonderful comment!
@mccartym86 10 місяців тому ⁺³
I think I had at least 10 aha moments watching this, and I've watched many videos on these topics. Incredible job, thank you!
@AICoffeeBreak 10 місяців тому ⁺¹
Wow, thank You for this wonderful comment!
@darylallen2485 8 місяців тому ⁺³
Letitia, you're awesome and I look forward to learning more from you.
@mumcarpet109 11 місяців тому ⁺⁶
your videos has helped visual learner like me so much, thank you
@AICoffeeBreak 11 місяців тому ⁺¹
Happy to hear that!
@heejuneAhn 6 місяців тому ⁺²
BEST of BEST Explanation. 1) Visually, 2) intuitively, 3) by numerical examples. And your English is better than native for Foreigners to listen.
@Clammer999 7 місяців тому ⁺²
Thanks so much for this video. I’ve gone through a number of videos on transformers and this is much easier to grasp and understand for a non-data scientist like myself.
@AICoffeeBreak 7 місяців тому
You're very welcome!
@davidespinosa1910 3 місяці тому
Time is quadratic, but memory is linear -- see the FlashAttention paper.
But the number of parameters is constant -- that's the magic !
Thanks for the excellent videos ! 👍
@GarySuffield-w9p 11 місяців тому ⁺⁵
Really well done and easy to follow, thank you
@AICoffeeBreak 11 місяців тому ⁺²
Glad you enjoy it!
@muhammedaneesk.a4848 11 місяців тому ⁺⁴
Thanks for the explanation 😊
@AICoffeeBreak 11 місяців тому ⁺²
Thanks for watching!
@HarishAkula-df8gs 9 місяців тому ⁺²
Amazing explanation, Thank you! Just discovered your channel and I really like how the difficult topics are demystified.
@AICoffeeBreak 9 місяців тому ⁺¹
Thanks a lot!
@xxlvulkann6743 9 місяців тому ⁺²
This is a very well-made explanation. I hadn't known that the feedforward layers only received one token at a time. Thanks for clearing that up for me! 😁
@paprikar 11 місяців тому ⁺³
here we go!
TY for content
@supanutsookkho2749 6 місяців тому ⁺²
Great video and a good explanation. Thanks for your hard work on this amazing video!!
@AICoffeeBreak 5 місяців тому ⁺¹
Glad you liked it!
@uw10isplaya 6 місяців тому ⁺²
Had to go back and rewatch a section after I realized I'd been spacing out staring at the coffee bean's reactions.
@AICoffeeBreak 5 місяців тому ⁺¹
@l3nn13 11 місяців тому ⁺⁴
great video
@AICoffeeBreak 11 місяців тому ⁺¹
Thanks for the visit and for leaving the comment!
@ehudamitai 11 місяців тому ⁺³
In 11:14, the weighted sum is the sum of 3 vectors of 3 elements each, but the results is a vector of 4 elements. Which, conveniently, is the same size as the input vector. Could there be a missing step there?
@AICoffeeBreak 11 місяців тому ⁺²
Yes, there is a missing back transformation to 4 dimensions I skipped. :) Well spotted!
@gettingdatasciencedone Місяць тому ⁺¹
Great explanation -- loving your videos.The time codes for specific topics is really useful.
@AICoffeeBreak 22 дні тому
Thank you!
@SamehSyedAjmal 11 місяців тому ⁺⁴
Thank you for the video! Maybe an explanation on the Mamba Architecture next?
@AICoffeeBreak 10 місяців тому ⁺³
The Mamba and SSM beans are roasting as we speak.
@tomoki-v6o 11 місяців тому ⁺³
well explained . as you promised
@AICoffeeBreak 11 місяців тому ⁺¹
@realbenjoyo 28 днів тому ⁺¹
This was really great, never really understood query, key and values before.
@AICoffeeBreak 22 дні тому
Thank you!
@phiphi3025 11 місяців тому ⁺³
Thanks, you helped so much explain Transformers to my PhD advisors
@AICoffeeBreak 11 місяців тому ⁺¹
This is really funny. In what field are you doing your PhD? 😅
@pfever 10 місяців тому ⁺²
Just discovered your channel and this is great! Thank you! :D
@AICoffeeBreak 10 місяців тому ⁺¹
Thank you! Hope to see you again soon in the comments.
@jonas4223 11 місяців тому ⁺⁴
Today, I had the problem I need to understand how Transformers work.. I searched on youtube and found your video 20 minutes after release. What a perfect timing
@AICoffeeBreak 11 місяців тому ⁺¹
What a timing!
@volpir4672 11 місяців тому ⁺⁵
that's great, I'm a little stuck on the special mask token? ... I'll keep digging, good info, the video is good explanation, it allows for more experimentation instead of relying on open source models that can have components look like a black box to noobs like me :)
@ArthasDKR 11 місяців тому ⁺³
Excellent explanation. Thank you!
@AICoffeeBreak 11 місяців тому ⁺¹
@bartlomiejkubica1781 11 місяців тому ⁺²
Thank you! Finally, I start to get it...
@M4ciekP 11 місяців тому ⁺⁵
How about a video explaining SSMs?
@AICoffeeBreak 11 місяців тому ⁺²
✍️
@AICoffeeBreak 10 місяців тому ⁺²
Psst: This will be the video coming up in a few days. it's in editing right now.
@M4ciekP 10 місяців тому
Yaay! @@AICoffeeBreak
@Ben_D. 9 місяців тому ⁺²
...ok. After binging some of your vids, I now need to go make coffee. 😆
@AICoffeeBreak 9 місяців тому ⁺¹
Please do!
@ai-interview-questions 11 місяців тому ⁺³
Thank you, Letitia!
@AICoffeeBreak 10 місяців тому ⁺¹
Our pleasure!
@zbynekba 11 місяців тому ⁺³
❤ Letitia, thank you for great visualization and intuition. For inspiration: In the original paper, the decoder utilizes the output of the encoder by running a cross-attention process. Why does GPT not use an encoder? As you've mentioned, the encoder is typically used for classification, while the decoder is for text generation. They are never used in combination. Why is this the case?
Missing Intuition: Why does the cross-attention layer inside the decoder take the values from the ENCODER’s output to create the enhanced embeddings (as a weighted mix)? Intuitively, I would use the values from the DECODER.
@AICoffeeBreak 11 місяців тому ⁺³
Thanks for your thoughts! Encoders are sometimes used in combination with decoders, right? The most famous example is the T5 architecture.
@zbynekba 11 місяців тому ⁺²
Thanks for your prompt reply. Hence, understanding the concept and intuition behind feeding the encoder output into the decoder is essential. I found only this one video on encoder-decoder cross-attention:
ua-cam.com/video/Dqjq4Gxdhng/v-deo.htmlsi=gtLzNxAU0pUGyLvk
In it, Lennart emphasizes the observation that, based on the original equations, we have the enhanced embeddings calculated as a weighted sum of ENCODER values. Inside of a DECODER, I would rather expect to have the DECODER values pass through.
Letitia, I am sure, you will resolve this mystery. 🍀
@TheAlexBell 2 місяці тому
Good explanation. Most videos on attention focus on how it's implemented, not on the design choices behind it. To my understanding, the goal was to mitigate the computational inefficiencies of RNNs and the spatial limitations of CNNs in order to achieve a universal representation of a sequence. I wanted to clarify one thing: you depicted multiple FFNNs similarly to how RNNs are usually rolled out. Is it just the same one FFNN that takes a single attention-encoded vector as input and predicts the next token from this ONE vector? By the way, what brand is that sweater? Loro Piana? :)
@Jeshhhhhh 4 місяці тому ⁺²
Oh my goddess in disguise, I thank you for saving me from depths of hell. Lots of love
@AICoffeeBreak 4 місяці тому
Glad to help. 😆
@kallamamran 11 місяців тому ⁺³
Phew 😳
@LEQN 9 місяців тому ⁺¹
Awesome video :) thanks!
@AICoffeeBreak 9 місяців тому ⁺¹
Thank you for watching and for your wonderful comment!
@heejuneAhn 5 місяців тому
Thanks for your video. I have a question on inference process. For example when I have a input prompt of 2 tokens = {t1, t2}. we will get the output {o1, o2, o3}. we will take only o3 and make new input sequence {t1, t2, o3}. Then we will get another output {o'1, o'2, o'3, o'4}.
Here my questions are. When we use causal masking for attention, o1= o'1 and o2=o'2 and so on? Another question, even though the mask guarantee the causal attension. but still the matrix calcuation is performed. Then it means the computation is used any way. How can we reduce the computation resource for this case.
@LinkhManu Місяць тому
You’re the best 👏👏👏
@nmfhlbj 9 місяців тому
hi! can i ask question of how did you get the dimension (d)? because all i know is dimension can be found in square matrices, and the dot product of the attention formula says that Q•K^T. if we're using 1x3 matrices, we'll get 1x1 matrices or 1 dimension, how do you get 3 ? unless its 3x1 matrix beforehand, so we'll get 3x3 or 3 dimensional matrix.
thankyouu !
@AICoffeeBreak 9 місяців тому ⁺¹
Hi, if you mean the mistake at 10:00, then the problem is that I have written matrix times vector when I should have written vector times matrix!
(or I could have used column vectors instead of row vectors). Is this what you mean?
@DaeOh 11 місяців тому ⁺⁴
Everything makes sense except multiple attention heads. Each layer has only one set of Q, K, V, O matrices. But 8 attention heads per layer? I want to understand that.
@AICoffeeBreak 11 місяців тому ⁺⁵
Think about it this way: In one layer, instead of having one head telling you how to pay attention at things, you have 8.
In other words, instead of having one person shout at you the things they want you to pay attention to, you have 8 people simultaneously shouting at you.
This is beneficial because it has an ensembling effect (the effect of a voting parliament. Think of Random Forests that are an ensemble of Decision Trees).
I do not know if this helps, but I thought giving it another shot at explaining this.
@benjamindilorenzo 10 місяців тому
What a great video.
It still could expand more and really sum up every sub-part and connect it to a certain clear visualization or clear step of what happens with the information at each time step and how its "transformation" progresses over time.
So i think you could redo this video and really make it monkey proof for folks like me.
But beware, if you look for example at the StatQuest version, its to slow and too repetative and also does not really capture, what really goes on inside the Transformer, once all steps are stacked together.
Great work!
@josephvanname3377 11 місяців тому
I want to train a transformer that eats a row of matrices instead of just a row of vectors.
@davide0965 11 днів тому
Terrible
@DerPylz 11 днів тому
If you don't like her videos, why do you keep coming back to them just to comment that you didn't like it? Just watch something else.

Наступне

Автоматичне відтворення

MAMBA and State Space Models explained | SSM explained