To learn more about Lightning: lightning.ai/ Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
“Statquest is all you need” - I really needed this video for my NLP course but glad it’s out now. I got an A+ for the course, your precious videos helped a lot!
The level of explainability from this video is top-notch. I always watch your video first to grasp the concept then do the implementation on my own. Thank you so much for this work !
Can’t thank enough for this guy helped me get my master degree in AI back in 2022, now I’m working as a data scientist and still kept going back to your videos.
Dang this came out just 2 days after my neural networks final. I’m still so happy to see this video in feed. You do such great work Josh! Please keep it up for all the computer scientists and statisticians that love your videos and eagerly await each new post
Hi mr josh, just wanna say that there is literally no one that makes it so easy for me to understand such complicated concepts. Thank you ! once I get a job I will make sure to give you guru dakshina! (meaning, an offering from students to their teachers)
Josh - I've read the original papers and countless online explanations, and this stuff never makes sense to me. You are the one and only reason as to why I understand machine learning. I wouldn't be able to make any progress on my PhD if it wasn't for your videos.
Great work, Josh! Listening to my deep learning lectures and reading papers become way easier after watching your videoes, because you explain the big picture and the context so well!! Eagerly waiting for the transformers video!
1 million subscribers INCOMING!!! Also huge thanks to Josh for providing such insightful videos. These videos really make everything easy to understand, I was trying to understand Attention and BAM!! found this gem.
The end is a classic cliffhanger for the series. You talk about how we don't need the LSTMs and I wait for an entire summer for transformers. Good job! :)
I just wanna let you know that this series is absolutely amazing. So far, as you can see, I've made it to the 89th video, guess that's something. Now it's getting serious tho. Again, love what you're doing here man!!! Thanks!!
@@statquest Personally, since I'm a medical student, I really can't explain how valuable it is to me that you used so many medical examples in the video's. The moment you said in one of the first video's that you are a geneticist I was sold to this series, it's one of my favorite subjects at uni, crazy interesting!
Hello Statquest, I would like to say Thank You for the amazing job, this content helped me understand a lot how Attention works, specially because visual things help me understand better, and the way you join the visual explanation with the verbal one while keeping it interesting is on another level, Amazing work
This is called Luong attention. In its previous version, a simple neural net was used to get similarity scores instead of dot product which was trained along with rest of RNN, this older version was called bahdanau attention. Thank you for the amazing video, I had to watch it twice to make sense of it but it is amazingly done. If I can make a request/suggestion, showing mathematical equations sometimes helps making sense of things. So if you can include them in future videos, that would be great.
3:14 That and the vanishing gradient problem is a key factor. NNs update themselves with gradient descent, basically derivatives, and the deeper the LSTM, the more we are applying the derivative of a derivative of a derivative so on so forth of a gradient value, and since the original loss value gradient is reduced astronomically every time a derivative, beyond a dozen or so LSTM cells the gradient might become 0 and this results in the earlier LSTMs literally not learning. So not only do LSTMs not remember stuff from previous words long away, they can't learn stuff on how to deal with previous words long away either, a double whammy :(
The way you explain complex subjects in a easy-to-understand format is amazing! Do you have an idea when will you release a video about transformers? Thank you Josh!
fun fact - if your vectors are scaled/mean-centered, cosine similarity is geometrically equivalent to the pearson correlation, and the dotproduct is the same as the covariance (un-scaled correlation).
Hi Josh, I just bought your books, Its amazing the way that you explain complex things, read the papers after wach your videos is easier. NOTE: waiting for the video of transformes
Hey Josh! Firstly, Thank you so much for this amazing content!! I can always count on your videos for a better explanation! I have one quick clarification to make. Before the fully dense layer. The first two numbers we get are from the [scaled(input1-cell1) + scaled(input2-cell1) ] and [scaled(input1-cell2) + scaled(input2-cell2) ] right? And the other two numbers are from the outputs of the decoder, right?
Hi StatQuest, I've been a long time fan, your videos have helped me TREMENDOUSLY. For this video I felt however if we could get a larger picture of how attention works first ( how different words can have different weights ( attending to them differently )) and then going through a run with actual values, it'd be great! :) I also felt that the arrows and diagrams got a bit confusing in this one. Again, this is only constructive criticism and maybe it works for others and just not for me ( this video I mean ). Nonetheless, thank you so much for all the time and effort you put into making your videos. You're helping millions of people out there clear their degrees and achieve life goals
Thanks for the feedback! I'm always trying to improve how I make videos. Anyway, I work through the concepts more in my videos on transformers: ua-cam.com/video/zxQyTK8quyY/v-deo.html and if the diagrams are hard to follow, I also show how it works using matrix math: ua-cam.com/video/KphmOJnLAdI/v-deo.html
@@statquest I'm currently trying to fine-tune Roberta so I'm really excited about the following video, hope the following videos will also talk about BERT and fine-tune BERT
목표: encoder의 마지막 토큰인 EOS와의 similarity를 계산해서 decoder의 첫 번째 토큰을 만들자. 11:52 한 토큰에 대해: 다른 토큰 포함해서 모든 토큰 하나씩 층을 만들어서 / 각 토큰층마다 내적곱으로 EOS와의 비슷한정도를 계산. => 각 토큰마다 점수로 나옴. 12:31 그 점수를 softmax로 계산하면 0부터 1까지의 값이 나옴. 더 비슷한 것을 decoder의 첫 번째 토큰 만드는 데 이용하는 것. 13:48 decoder에서 softmax로 다시 계산해서 deocer의 첫 번째 토큰 생성. 중요한 것은 11:52에서 '한 토큰에 대해'서 계산했다는 것. - 원래는 모든 토큰 전체 층을 decoder에 보내서 decoder의 첫 번째 토큰 만드는 데 이용했다면, - attetion은 한 토큰마다 전체 층 내적곱 구해서 decoder의 첫 번째 토큰 만드는 데 이용하는 것.
Hi Josh! No doubt, you teach in the best way. I have a request, I have been enrolled in PhD and going to start my work on Graphs, Can you please make a video about Graph Neural Networks and its variants, Thanks.
Many thanks for your great video! I have a question. You said that we calculate the similarity score between 'go' and EOS (11:30). But I think the vector (0.01,-0.10) is the context vector for "let's go" instead of "go" since the input includes the output for 'Let's' as well as the embedding vector for 'go'. It seems that the similarity score between 'go' and EOS is actually the similarity score between "let's go" and EOS. Please make it clear!
You can talk about it either way. Yes, it is the context vector for "Let's go", but it's also the encoding, given that we have already encoded "Let's", of the word "go".
Your explanation is AMAZING AS ALWAYS!! I have 1 doubt. Do we do the attention calculation only on the final layer? For example, if there are 2 layers in encoder and 2 layers in decoder, we use only the outputs from 2nd layer of encoder and 2nd layer of decoder for attention estimation, right?
first of all thanks a lot Josh! you made it way too understandable for us and i would be forever grateful to you for this !! Have a nice time! and can you please upload videos on Bidirectional LSTM and BERT?
Superb Videos. One question, is the fully connected layer just simply the softmax layer, there is no hidden layer with weights (meaning no weights are learned)?
No, there are weights along the connections between the input and output of the fully connected layer, and those outputs are then pumped into the softmax. I apologize for not illustrating the weights in this video. However, I included them in my video on transformers, and it's the same here. Here's the link to the transformers video: ua-cam.com/video/zxQyTK8quyY/v-deo.html
Do you have any courses with start-to-finish projects for people who are only just getting interested in machine learning? Your explanations on the mathematical concepts has been great and I'd be more than happy to pay for a course that implements some of these concepts into real world examples
I don't have a course, but hope to have one one day. In the meantime, here's a list of all of my videos somewhat organized: statquest.org/video-index/ and I do have a book called The StatQuest Illustrated Guide to Machine Learning: statquest.org/statquest-store/
Hi Josh, thanks again for awesomest video ever made on Attention models. The video is so wonderfully made that it made such involved concept crystal clear. However, I have one small doubt. Till time step 14:37 you explained the attention with single layer of LSTMs. But what if we have two layers in Encoder and Decoder as we have in previous Seq2Seq Encoder-Decoder video. In that case, how the attention is going to get calculated. My guess is that we will calculate similarity score between LSTM output of second layer for each token with LSTM output of Decoder and feed the final similarity score to Fully Connected Layer along with output of hidden cells of LSTMs of second layer. Or will we calculate similarity score between LSTM output of each layer in Encoder with each layer in Decoder as pass the input to the FC layer along with the output of second layer in Decoder since that is the final output from the Decoder. Thanks a lot again for being our saviour and your presence makes this the best time to learn new things.
Thank you! I'm pretty sure we would calculate the similarities between each layer in the encoder with each later in the decoder to pass them to a fully connected layer.
To really create a translator model, we would have to work a lot through values of linguistics since there are differences in word order, verb conjugation, idioms, etc. Going from one language to another is a big structural challenge for coders.
That's the way they used to do it - by using linguistics. But very few people do it that way anymore. Now pretty much all translation is done with transformers (which are just encoder-decoder networks with attention, but not the LSTMs). Improvements in translation quality are gained simply by adding more layers of attention and using larger training datasets. For more details, see: en.wikipedia.org/wiki/Natural_language_processing
The outputs from the softmax function are multiplied with the short-term memories coming out of the encoders LSTM units. We then add those products together to get -0.3 and 0.3.
Hello, Thank you for the video, but I am so confused that some terms introduced in original 'Attention is All You Need' paper were not mentioned in video, for example, keys, values, and queries. Furthermore, in the paper, authors don't talk about cosine similarity and LSTM application. Can you please clarify this case a little bit much better?
The "Attention is all you need" manuscript did not introduce the concept of attention. That does done years earlier, and that is what this video describes. If you'd like to understand the "Attention is all you need" concept of transformers, check out my video on transformers here: ua-cam.com/video/zxQyTK8quyY/v-deo.html
Hi Josh. Amazing content as always, but this time I couldn't understand the explanation. Usually, I follow every step taking notes, but when it came to the attention and how it behaves I could not make sense of it even after watching multiple times and consulting external material. I wonder if perhaps it is in the cards to make a revision of this video, with a more lengthy sentence to encode and decode. I am trying to get more material from other sources to better understand this. All the best.
@@statquest Hi Josh! So I did more research and implemented a couple of models from scratch: vanilla RNN, LSTM, and finally Seq2Seq with attention. What was confusing to me, without having implemented anything, were some concepts involving the models I mentioned above, such as: 1 - For models that use context vector, what is its size? 2 - Is the context vector passed at every encoding step of the encoder? 3 - In the attention models, do we pass the hidden state from the encoder to the decoder after every encoding step? So my answers to that is (and correct me if I am wrong): 1 - The context vector is the size of the hidden layer 2 - As far as I know and in the models I worked with, the context vector is only passed once the encoder part is done. 3 - As far as I know and in the models I worked with (again), the collection of hidden states is passed as a bunch after the encoding is done. PART 2 And In addition to that, I was very confused with the model being explained here, but after my Seq2Seq with attention, I think I would summarize the current situation as follows: We encode the sentence: [Let’s, go, ] We start the decoder with: [] Then, using the attention mechanism, we calculate the similarity between the token word already placed in the decoder: [] - and all the words from the encoder. Since we are looking for the next word in the decoder sequence, "Let’s" will have a lower attention score using softmax, while "go" will be higher. Thus, "go" is selected. We pass this through a linear neural network, a softmax function again, and we land with the pick for the word "vamos". We repeat the process again, looking for the next word in the decoder sequence, having: [, "vamos"] and the attention from the decoder, then we land on , and the final sequence is [, "vamos", ].
@@paulotcj That's great! One minor detail is that the hidden state and the cell state make up the context vector. I've got a pytorch tutorial on this topic coming out with my new book in January.
Hello, I have a doubt. The initialization of the cell state and hidden state of the decoder is a context vector that is the representation (generated by encoder) of the entire sentence (input)? And what about each hidden state (from encoder) used in decoder? Are they stored somehow? Thanks!!!
I have one fundamental question related to how attention model learns, so basically higher attention score is given to those pairs of word which have higher softmax (Q.K) similarity score. Now the question is how relationship in the sentence "The cat didn't climb the tree as it was too tall" is calculated and it knows that in this case "it" refers to tree and not "cat" . Is it from large content of data that the model reads helps it in distinguishing the difference ?
Hi @statquest / @Josh ... This is an amazing video and i had been going through your content. All of those content are some of the best explanations of AI that I have seen till date. In this video towards the end where we are setting the input values of the fully connected layer, i am not able to place the values besides the value of one of the attention value. Please confirm below if I am right: Value from Encoder Layer: let's : -0.76(1st LSTM) | 0.75(2nd LSTM) go: 0.01(1st LSTM) | -0.01(2nd LSTM) Value from Decoder Layer: EOS: 0.91(1st LSTM) | 0.38(2nd LSTM) Similarity Scores: Lets and EOS : (0.91 X -0.76) + (0.38 X 0.75) = -0.6916 + 0.285 = -0.4066 ~ -0.41 go and EOS: (0.91 X 0.01) + (0.38 X -0.01) = 0.0091 + -0.0038 = 0.0053 ~ -0.01 After Softmax Lets and EOS: 0.4 go and EOS: 0.6 Attention Value for 1st LSTM which is rolled twice(for lets and go): -0.76*0.4 + 0.01*06 = -0.298 ~ -0.3 0.75*0.4 + -0.01*0.6 = 0.3 - 0.06 = 0.24 Thus we get the following input values for the fully connected layer: 1. Value from 1st LSTM Layer(Decoder) -> EOS: 0.91 2. Attention Value for 1st LSTM Layer(Encode) wrt EOS -> -0.3 I suppose the following two values are what we get from 2nd LSTM layer which has a different initial values for initial Short term memory and Long Term memory: 3. Value from 2nd LSTM Layer(Decoder) -> EOS: 0.4 Let me know if my understanding is correct Josh.
Sorry I can not quite understood, 1. why the output of decoding (0.9, 0.4) could plug in the attention values (-0.3,0.3)? What if the total length of them is not four? for example if I have 3 decoding output values and 3 attention values, the total length of fc layer is six unequal to the sequence length 4. 2. What does "Do some math" mean? how (-0.3, 0.3, 0.9, 0.4) became (-0.7,4.7,-2,-2), why the maximum 0.9 correspond to -2 ?
Fantastic video, indeed! Is the attention described in the video the same as in the attention paper? I didn't see the mention of QKV in the video and would like to know whether it was omitted to simplify or by mistake.
Are you asking about the QKV notation that appears in the "Attention is all you need" paper? That manuscript arxiv.org/abs/1706.03762 , which came out in 2017, didn't introduce the concept of attention for neural networks. Instead it introduces a more advanced topic - Transformers. The original "how to add attention to neural networks" manuscript arxiv.org/pdf/1409.0473.pdf came out in 2015 and did not use the QKV notation that appeared later in the transformer manuscript. Anyway, my video follows the original, 2015, manuscript. However, I'm working on a video that covers the 2017 manuscript right now. And I've got a long section talking all about the QKV stuff in it. That said, in this video, you can think of the output from each LSTM in the decoder as a "Query", and the outputs from each LSTM in the Encoder as the "Keys" and "Values". The "Keys" are used, in conjunction with each "Query" to calculate the Similarity Scores and the "Values" are then scaled by those scores to create the attention values.
To learn more about Lightning: lightning.ai/
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
@statquest No problem Sir.
Thank you.
“Statquest is all you need” - I really needed this video for my NLP course but glad it’s out now. I got an A+ for the course, your precious videos helped a lot!
BAM! :)
Somehow Josh always figures out what video are we going to need!
Exactly, I was gonna say the same 😃
BAM! :)
Same here 😂
The level of explainability from this video is top-notch. I always watch your video first to grasp the concept then do the implementation on my own. Thank you so much for this work !
Glad it was helpful!
This channel is pure gold. I'm a machine learning and deep learning student.
Thanks!
Can’t thank enough for this guy helped me get my master degree in AI back in 2022, now I’m working as a data scientist and still kept going back to your videos.
BAM!
The amount of effort for some of these animations, especially in these videos on Attention and Transformers in insane. Thank you!
Glad you like them!
Dang this came out just 2 days after my neural networks final. I’m still so happy to see this video in feed. You do such great work Josh! Please keep it up for all the computer scientists and statisticians that love your videos and eagerly await each new post
Thank you very much! :)
@@statquest it came out 3 days before my Deep Learning and NNs final. BAM!!!
@@Neiltxu Awesome! I hope it helped!
@@statquest for sure! Your videos always help! btw, do you ship to spain? I like the hoodies of your shop
@@Neiltxu I believe the hoodies ship to Spain. Thank you for supporting StatQuest! :)
Hi mr josh, just wanna say that there is literally no one that makes it so easy for me to understand such complicated concepts. Thank you ! once I get a job I will make sure to give you guru dakshina! (meaning, an offering from students to their teachers)
Thank you very much! I'm glad my videos are helpful! :)
This is the best explanation ever, not only in this video, but the entire course...... Thanks a lot...
Glad you are enjoying the whole course.
I was literally trying to understand attention a couple of days ago and Mr.BAM posts a video about it. Thanks 😊
same :D abesolutely insane...
BAM! :)
Josh - I've read the original papers and countless online explanations, and this stuff never makes sense to me. You are the one and only reason as to why I understand machine learning. I wouldn't be able to make any progress on my PhD if it wasn't for your videos.
Thanks! I'm glad my videos are helpful! :)
Great work, Josh! Listening to my deep learning lectures and reading papers become way easier after watching your videoes, because you explain the big picture and the context so well!! Eagerly waiting for the transformers video!
Coming soon! :)
1 million subscribers INCOMING!!!
Also huge thanks to Josh for providing such insightful videos. These videos really make everything easy to understand, I was trying to understand Attention and BAM!! found this gem.
Thank you very much!!! BAM! :)
I was just reading the original attention paper and then BAM! You uploaded the video. Thank you for creating the best content on AI on UA-cam!
Thank you very much! :)
This is awesome mate, can't wait for the next installment! Your tutorials are indispensable!
Thank you!
@@statquest BAM!
for this video attention is all you need
Ha!
The best explanation of Attention that I have come across so far ...
Thanks a bunch❤
Thank you very much! :)
Thanks for the wholesome contents! Looking for Statquest video on the Transformer.
Wow!!! Thank you so much for supporting StatQuest!!! I'm hoping the StatQuest on Transformers will be out by the end of the month.
❤
The end is a classic cliffhanger for the series. You talk about how we don't need the LSTMs and I wait for an entire summer for transformers. Good job! :)
Ha! The good news is that you don't have to wait! You can binge! Here's the link to the transformers video: ua-cam.com/video/zxQyTK8quyY/v-deo.html
@@statquestYeah! I already watched when you released it. I commented on how this deep learning playlist is becoming a series! :)
@@usser-505 bam!
I just wanna let you know that this series is absolutely amazing. So far, as you can see, I've made it to the 89th video, guess that's something. Now it's getting serious tho. Again, love what you're doing here man!!! Thanks!!
Thank you so much!
@@statquest Personally, since I'm a medical student, I really can't explain how valuable it is to me that you used so many medical examples in the video's. The moment you said in one of the first video's that you are a geneticist I was sold to this series, it's one of my favorite subjects at uni, crazy interesting!
@@benmelis4117 BAM! :)
The BEST explanation of Attention models!! Kudos & Thanks 😊
Thank you very much!
Hello Statquest, I would like to say Thank You for the amazing job, this content helped me understand a lot how Attention works, specially because visual things help me understand better, and the way you join the visual explanation with the verbal one while keeping it interesting is on another level, Amazing work
Thank you!
You have a talent for explaining these things in a straightforward way. Love your videos. You have no video about Transformers yet, right?
The transformers video is currently available to channel members and patreon supporters.
This is called Luong attention. In its previous version, a simple neural net was used to get similarity scores instead of dot product which was trained along with rest of RNN, this older version was called bahdanau attention.
Thank you for the amazing video, I had to watch it twice to make sense of it but it is amazingly done. If I can make a request/suggestion, showing mathematical equations sometimes helps making sense of things. So if you can include them in future videos, that would be great.
I'll keep that in mind.
this was the most beautiful explanation that i ever had in my entire life, thank you!
Wow, thank you!
3:14 That and the vanishing gradient problem is a key factor. NNs update themselves with gradient descent, basically derivatives, and the deeper the LSTM, the more we are applying the derivative of a derivative of a derivative so on so forth of a gradient value, and since the original loss value gradient is reduced astronomically every time a derivative, beyond a dozen or so LSTM cells the gradient might become 0 and this results in the earlier LSTMs literally not learning. So not only do LSTMs not remember stuff from previous words long away, they can't learn stuff on how to deal with previous words long away either, a double whammy :(
bam! :)
@@statquest It's a double bam but it is directed at our faces and our NN, not at the problem we are trying to solve, which is really bad :(
The way you explain complex subjects in a easy-to-understand format is amazing! Do you have an idea when will you release a video about transformers? Thank you Josh!
I'm shooting for the end of the month.
Hi Josh@@statquest , any update on the following? Would definitely need it for my final tomorrow :))
@@JeremyHalfon I'm finishing my first draft today. Hope to edit it this weekend and record next week.
You are amazing! The best explanation I've ever found on UA-cam.
Wow, thanks!
I was stunned when you start the video with a catch jingle man, cheers :D
:)
The music sang before the video are contagious ❤
:)
Really looking forward to your explanation of Transformers!!!
Thanks!
I was literally just thinking an Id love an explanation of attention by SQ..!!! Thanks for all your work
bam!
I feel like I am watching a cartoon as a kid. :)
bam!
bu mecrada ilk defa türk görüyorum, bilg müh ögrencisi misin?
Ah excellent this is exactly what I was looking for!
Thank you!
@@statquest Can't wait for the next episode on Transformers!
Thanks for the amazing explanation. TRIPLE BAM!!!
:)
can't wait for the video about Transformers!
Me too!
Josh! Again to geg some attention with a cup of coffee, Double BAM!!
Thanks!
fun fact - if your vectors are scaled/mean-centered, cosine similarity is geometrically equivalent to the pearson correlation, and the dotproduct is the same as the covariance (un-scaled correlation).
nice.
I am currently taking the AI cert program from MIT - I thank you for your channel
Thanks and good luck!
Hi Josh, I just bought your books, Its amazing the way that you explain complex things, read the papers after wach your videos is easier.
NOTE: waiting for the video of transformes
Glad you like them! I hope the video on Transformers is out soon.
Excellent josh.... So finally MEGA Bammm is approaching.....
Hope u r doing good...
Yes! Thank you! I hope you are doing well too! :)
Hey Josh! Firstly, Thank you so much for this amazing content!! I can always count on your videos for a better explanation!
I have one quick clarification to make. Before the fully dense layer. The first two numbers we get are from the [scaled(input1-cell1) + scaled(input2-cell1) ] and [scaled(input1-cell2) + scaled(input2-cell2) ] right?
And the other two numbers are from the outputs of the decoder, right?
Yes.
@@statquest Thank you for the clarification!
Hi StatQuest, I've been a long time fan, your videos have helped me TREMENDOUSLY. For this video I felt however if we could get a larger picture of how attention works first ( how different words can have different weights ( attending to them differently )) and then going through a run with actual values, it'd be great! :) I also felt that the arrows and diagrams got a bit confusing in this one. Again, this is only constructive criticism and maybe it works for others and just not for me ( this video I mean ). Nonetheless, thank you so much for all the time and effort you put into making your videos. You're helping millions of people out there clear their degrees and achieve life goals
Thanks for the feedback! I'm always trying to improve how I make videos. Anyway, I work through the concepts more in my videos on transformers: ua-cam.com/video/zxQyTK8quyY/v-deo.html and if the diagrams are hard to follow, I also show how it works using matrix math: ua-cam.com/video/KphmOJnLAdI/v-deo.html
can't wait for the next StatQuest
:)
@@statquest I'm currently trying to fine-tune Roberta so I'm really excited about the following video, hope the following videos will also talk about BERT and fine-tune BERT
@@thanhtrungnguyen8387 I'll keep that in mind.
Been wanting this video for so long, gonna watch it soon!
bam!
When I see new vid from Josh, I know today is a good day! BAM!
BAM! :)
Hey Josh your explanation is easy to understand. Thanks
Glad it was helpful!
I have been waiting for this for a long time
Transformers comes out on monday...
Godsent! Just what I needed! Thanks Josh.
bam!
Much awaited one .... Awesome as always ..
Thank you!
I'm excited for the video about transformers. Thank you Josh, your videos are extremely helpful
Coming soon!
Had been waiting for this for months.
The wait is over! :)
목표: encoder의 마지막 토큰인 EOS와의 similarity를 계산해서 decoder의 첫 번째 토큰을 만들자.
11:52 한 토큰에 대해: 다른 토큰 포함해서 모든 토큰 하나씩 층을 만들어서 / 각 토큰층마다 내적곱으로 EOS와의 비슷한정도를 계산. => 각 토큰마다 점수로 나옴.
12:31 그 점수를 softmax로 계산하면 0부터 1까지의 값이 나옴. 더 비슷한 것을 decoder의 첫 번째 토큰 만드는 데 이용하는 것.
13:48 decoder에서 softmax로 다시 계산해서 deocer의 첫 번째 토큰 생성.
중요한 것은 11:52에서 '한 토큰에 대해'서 계산했다는 것.
- 원래는 모든 토큰 전체 층을 decoder에 보내서 decoder의 첫 번째 토큰 만드는 데 이용했다면,
- attetion은 한 토큰마다 전체 층 내적곱 구해서 decoder의 첫 번째 토큰 만드는 데 이용하는 것.
bam
Hi Josh! No doubt, you teach in the best way. I have a request, I have been enrolled in PhD and going to start my work on Graphs, Can you please make a video about Graph Neural Networks and its variants, Thanks.
I'll keep that in mind.
Since you asked for video suggestions in another video: A video about the EM and Mean Shift algorithm would be great!
I'll keep that in mind.
weeeeee,
video for tonite,
tanks a lot
:)
Was eagerly waiting for this video
Bam! :)
thank you so much for making these great materials
Thanks!
Many thanks for your great video!
I have a question. You said that we calculate the similarity score between 'go' and EOS (11:30). But I think the vector (0.01,-0.10) is the context vector for "let's go" instead of "go" since the input includes the output for 'Let's' as well as the embedding vector for 'go'. It seems that the similarity score between 'go' and EOS is actually the similarity score between "let's go" and EOS. Please make it clear!
You can talk about it either way. Yes, it is the context vector for "Let's go", but it's also the encoding, given that we have already encoded "Let's", of the word "go".
Amazing video Josh! Waiting for the transformer video. Hopefully it'll come out soon. Thanks for everything!
Thanks! I'm working on it! :)
Your explanation is AMAZING AS ALWAYS!!
I have 1 doubt. Do we do the attention calculation only on the final layer? For example, if there are 2 layers in encoder and 2 layers in decoder, we use only the outputs from 2nd layer of encoder and 2nd layer of decoder for attention estimation, right?
I believe that is correct, but, to be honest, I don't think there is a hard rule.
I am always amazed by your tutorials! Thanks. And when we can expect the transformer tutorial to be uploaded?
Tonight!
Thank you for the excellent teaching, Josh. Looking forward to the Transformer tutorial. :)
Coming soon!
Thanks for this. The way you step through the logic is always very helpful
Thanks!
super clutch my final is on thursday thanks a lot!
Good luck!
I am stilling learning this so hope next video come out soon
I'm working on it as fast as I can.
Before, I was dumb, "guitar"
But now, people say I'm smart "guitar"
What is changed ? "guitar"
Now I watch.....
StatQueeeeeest ! "guitar guitar"
bam!
Thanks Professor Josh for such a great tutorial ! It was very informative !
My pleasure!
Currently learning about artificial neural networks😁
bam! :)
You're amazing Josh, thank you so much for all this content
Glad you enjoy it!
Can't wait for the transformer video!
I'm making great progress on it.
Please add to the neural network playlist! Or don't it's your video, I just want to be able to find it when I'm looking for it to study for class.
I'll add it to the playlist, but the best place to find my stuff is here: statquest.org/video-index/
You are on Fire! Thank you so much
Thank you! :)
first of all thanks a lot Josh! you made it way too understandable for us and i would be forever grateful to you for this !! Have a nice time! and can you please upload videos on Bidirectional LSTM and BERT?
I'll keep those topics in mind.
best tutorial in youtube
Thank you!
Hey! Great video, this is really helping me with neural networks at the university, do we have a date for when the transformer video comes out?
Soon....
Another awesome video! Josh, will you plan to talk about BERT? Thank you!
I'll keep that in mind.
Phew! Lots of things in this model, my brain feels a bit overloaded, haha
But thanks! Might have to rewatch this
You can do it!
Superb Videos. One question, is the fully connected layer just simply the softmax layer, there is no hidden layer with weights (meaning no weights are learned)?
No, there are weights along the connections between the input and output of the fully connected layer, and those outputs are then pumped into the softmax. I apologize for not illustrating the weights in this video. However, I included them in my video on transformers, and it's the same here. Here's the link to the transformers video: ua-cam.com/video/zxQyTK8quyY/v-deo.html
wow, i didn't think i would see this kind of stuff on this channel.
:)
Another BAM!
Thanks!
Do you have any courses with start-to-finish projects for people who are only just getting interested in machine learning?
Your explanations on the mathematical concepts has been great and I'd be more than happy to pay for a course that implements some of these concepts into real world examples
I don't have a course, but hope to have one one day. In the meantime, here's a list of all of my videos somewhat organized: statquest.org/video-index/ and I do have a book called The StatQuest Illustrated Guide to Machine Learning: statquest.org/statquest-store/
Hi Josh, thanks again for awesomest video ever made on Attention models. The video is so wonderfully made that it made such involved concept crystal clear. However, I have one small doubt. Till time step 14:37 you explained the attention with single layer of LSTMs. But what if we have two layers in Encoder and Decoder as we have in previous Seq2Seq Encoder-Decoder video. In that case, how the attention is going to get calculated.
My guess is that we will calculate similarity score between LSTM output of second layer for each token with LSTM output of Decoder and feed the final similarity score to Fully Connected Layer along with output of hidden cells of LSTMs of second layer.
Or will we calculate similarity score between LSTM output of each layer in Encoder with each layer in Decoder as pass the input to the FC layer along with the output of second layer in Decoder since that is the final output from the Decoder.
Thanks a lot again for being our saviour and your presence makes this the best time to learn new things.
Thank you! I'm pretty sure we would calculate the similarities between each layer in the encoder with each later in the decoder to pass them to a fully connected layer.
Great, that's really what I was looking for, thanks mr Starmer for the explanation ❤
bam! :)
To really create a translator model, we would have to work a lot through values of linguistics since there are differences in word order, verb conjugation, idioms, etc. Going from one language to another is a big structural challenge for coders.
That's the way they used to do it - by using linguistics. But very few people do it that way anymore. Now pretty much all translation is done with transformers (which are just encoder-decoder networks with attention, but not the LSTMs). Improvements in translation quality are gained simply by adding more layers of attention and using larger training datasets. For more details, see: en.wikipedia.org/wiki/Natural_language_processing
Could you do a video about Bert? Architectures like these can be very helpful on NLP and I think a lot of folks will benefit from that :)
I've got a video on transformers coming out soon.
Hi, great video. At 13:49 can you please explain how you get -.3 and 0.3 for the input to the fully connected? THank you
The outputs from the softmax function are multiplied with the short-term memories coming out of the encoders LSTM units. We then add those products together to get -0.3 and 0.3.
Thanks! This was a great video!
Thank you very much! :)
Hello, Thank you for the video, but I am so confused that some terms introduced in original 'Attention is All You Need' paper were not mentioned in video, for example, keys, values, and queries. Furthermore, in the paper, authors don't talk about cosine similarity and LSTM application. Can you please clarify this case a little bit much better?
The "Attention is all you need" manuscript did not introduce the concept of attention. That does done years earlier, and that is what this video describes. If you'd like to understand the "Attention is all you need" concept of transformers, check out my video on transformers here: ua-cam.com/video/zxQyTK8quyY/v-deo.html
Hi Josh. Amazing content as always, but this time I couldn't understand the explanation. Usually, I follow every step taking notes, but when it came to the attention and how it behaves I could not make sense of it even after watching multiple times and consulting external material.
I wonder if perhaps it is in the cards to make a revision of this video, with a more lengthy sentence to encode and decode. I am trying to get more material from other sources to better understand this.
All the best.
Can you give me specifics about the details you find confusing?
@@statquest Hi Josh! So I did more research and implemented a couple of models from scratch: vanilla RNN, LSTM, and finally Seq2Seq with attention.
What was confusing to me, without having implemented anything, were some concepts involving the models I mentioned above, such as:
1 - For models that use context vector, what is its size?
2 - Is the context vector passed at every encoding step of the encoder?
3 - In the attention models, do we pass the hidden state from the encoder to the decoder after every encoding step?
So my answers to that is (and correct me if I am wrong):
1 - The context vector is the size of the hidden layer
2 - As far as I know and in the models I worked with, the context vector is only passed once the encoder part is done.
3 - As far as I know and in the models I worked with (again), the collection of hidden states is passed as a bunch after the encoding is done.
PART 2
And In addition to that, I was very confused with the model being explained here, but after my Seq2Seq with attention, I think I would summarize the current situation as follows:
We encode the sentence: [Let’s, go, ]
We start the decoder with: []
Then, using the attention mechanism, we calculate the similarity between the token word already placed in the decoder: [] - and all the words from the encoder.
Since we are looking for the next word in the decoder sequence, "Let’s" will have a lower attention score using softmax, while "go" will be higher. Thus, "go" is selected. We pass this through a linear neural network, a softmax function again, and we land with the pick for the word "vamos".
We repeat the process again, looking for the next word in the decoder sequence, having: [, "vamos"] and the attention from the decoder, then we land on , and the final sequence is [, "vamos", ].
@@paulotcj That's great! One minor detail is that the hidden state and the cell state make up the context vector. I've got a pytorch tutorial on this topic coming out with my new book in January.
quadruple BAM !
Thanks!
Hello, I have a doubt. The initialization of the cell state and hidden state of the decoder is a context vector that is the representation (generated by encoder) of the entire sentence (input)? And what about each hidden state (from encoder) used in decoder? Are they stored somehow? Thanks!!!
1) Yes, the context vector is a representation of the entire input.
2) The hidden states in the encoder are stored for attention.
@@statquest Thanks!!
I have one fundamental question related to how attention model learns, so basically higher attention score is given to those pairs of word which have higher softmax (Q.K) similarity score. Now the question is how relationship in the sentence "The cat didn't climb the tree as it was too tall" is calculated and it knows that in this case "it" refers to tree and not "cat" . Is it from large content of data that the model reads helps it in distinguishing the difference ?
Yes. The more data you have, the better attention is going to work.
thank you sir for your brilliant work!
Thank you!
Attention is all you need...
bam!
Hi @statquest / @Josh ... This is an amazing video and i had been going through your content. All of those content are some of the best explanations of AI that I have seen till date. In this video towards the end where we are setting the input values of the fully connected layer, i am not able to place the values besides the value of one of the attention value. Please confirm below if I am right:
Value from Encoder Layer:
let's : -0.76(1st LSTM) | 0.75(2nd LSTM)
go: 0.01(1st LSTM) | -0.01(2nd LSTM)
Value from Decoder Layer:
EOS: 0.91(1st LSTM) | 0.38(2nd LSTM)
Similarity Scores:
Lets and EOS : (0.91 X -0.76) + (0.38 X 0.75) = -0.6916 + 0.285 = -0.4066 ~ -0.41
go and EOS: (0.91 X 0.01) + (0.38 X -0.01) = 0.0091 + -0.0038 = 0.0053 ~ -0.01
After Softmax
Lets and EOS: 0.4
go and EOS: 0.6
Attention Value for 1st LSTM which is rolled twice(for lets and go):
-0.76*0.4 + 0.01*06 = -0.298 ~ -0.3
0.75*0.4 + -0.01*0.6 = 0.3 - 0.06 = 0.24
Thus we get the following input values for the fully connected layer:
1. Value from 1st LSTM Layer(Decoder) -> EOS: 0.91
2. Attention Value for 1st LSTM Layer(Encode) wrt EOS -> -0.3
I suppose the following two values are what we get from 2nd LSTM layer which has a different initial values for initial Short term memory and Long Term memory:
3. Value from 2nd LSTM Layer(Decoder) -> EOS: 0.4
Let me know if my understanding is correct Josh.
What time point, minutes and seconds, are you asking about?
13:52@@statquest
@@sunnywell264 The values are pretty close and probably slightly off due to rounding. Is that what you're worried about or is there something else?
Yes... I was worried about the delta in the values. I hope that my calculations above are correct and i am not at fault there.
@@sunnywell264 It's possible that, internally, my math is not rounding at each stage, so I'd be willing to bet that your math is just fine.
Sorry I can not quite understood,
1. why the output of decoding (0.9, 0.4) could plug in the attention values (-0.3,0.3)? What if the total length of them is not four? for example if I have 3 decoding output values and 3 attention values, the total length of fc layer is six unequal to the sequence length 4.
2. What does "Do some math" mean? how (-0.3, 0.3, 0.9, 0.4) became (-0.7,4.7,-2,-2), why the maximum 0.9 correspond to -2 ?
What time points, minutes and seconds, are you referring to?
great video thanks
Thanks!
Fantastic video, indeed! Is the attention described in the video the same as in the attention paper? I didn't see the mention of QKV in the video and would like to know whether it was omitted to simplify or by mistake.
Are you asking about the QKV notation that appears in the "Attention is all you need" paper? That manuscript arxiv.org/abs/1706.03762 , which came out in 2017, didn't introduce the concept of attention for neural networks. Instead it introduces a more advanced topic - Transformers. The original "how to add attention to neural networks" manuscript arxiv.org/pdf/1409.0473.pdf came out in 2015 and did not use the QKV notation that appeared later in the transformer manuscript. Anyway, my video follows the original, 2015, manuscript. However, I'm working on a video that covers the 2017 manuscript right now. And I've got a long section talking all about the QKV stuff in it.
That said, in this video, you can think of the output from each LSTM in the decoder as a "Query", and the outputs from each LSTM in the Encoder as the "Keys" and "Values". The "Keys" are used, in conjunction with each "Query" to calculate the Similarity Scores and the "Values" are then scaled by those scores to create the attention values.
@@statquest Thanks for the reply, Josh. Yes, I was referring to the 2017 paper. I look forward to your video covering it.