To learn more about Lightning: lightning.ai/ Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
One of the best channels on youtube! Wanted to provide some constructive criticism: Either I am blind or you have forgotten to link the og paper you show in the video in the video description.
@@statquest Is sequence to sequence model and encoder decoder architecture are both a same concept or they are different from each other? Can you resolve this problem asap? I'm having my exams tomorrow.....😶🌫️
This channel is gold. I remember how, for my first coding job, where I had no programming knowledge (lol) but had no choice than to take it anyways, I quickly had to learn php and mysql. To get myself started, I searched for the simplest php coding books and then got myself two books from the php & mysql for kids series, even though I was already in my mid twenties. Long story short, I quickly learned the basics, and did code for a living. Complex topics don't have to be complex, in fact they are always built on building blocks of simple concepts and can be explained and taught as such IMHO. Thank you so much for explaining it KISS style. Because once again, I have to learn machine learning more or less from scratch, but this time for my own personal projects.
It took me more than 16 minutes (the length of the video) to get what happens since I have to pause the video to think, but I should say it is very clearly explained! Love your video!!
Hooray! I'm glad the video was helpful. Now that you understand Seq2Seq, I bet you could understand Transformers relatively easily: ua-cam.com/video/zxQyTK8quyY/v-deo.html
I am nearly finished with the ML playlist, and I must say that I am extremely satisfied with the content and I do regret spending so much time on other videos and courses before finding your channel. Thank you for creating this invaluable resource. I hope you continue to create videos on more complex subjects, particularly in the medical field. Once again, thank you very much.
I can't thank you enough for these tutorials on NLP. From the first tutorial related to RNNs to this tutorial, you explained so concisely and clearly notions that I have struggled and was scared to tackle for couple of weeks, due to the amount of papers/tutorials someone should read/watch in order to be up to date with the most recent advancement in NLP/ASR. You jump-started my journey and made it much more pleasant! Thank you so much!
Another great explanation! It is so comforting to know that whatever I don't understand in class, I can always find a video in your channel and be confident that I will understand by the end. Thank you!
An awesome video as always! Super excited for videos on attention, transformers and LLM. In the era of AI and ChatGPT, these are going to go viral, making this knowledge accessible to more people, explained in a much simpler manner.
Wonderful tutorial! Studying on Statquest is really like a recursive process. I first search for transformers, then follow the links below all the way to RNN, and finally study backward all the way to the top! That is a really good learning experience thanks!
I got my finals of my final course in my final day tomorrow of my undergraduate journey and you posted this exactly few hours ago.. thats a triple final bam for me
I just wanted to mention that I really love and appreciate you as well as your content. You have been an incredible inspiration for me and my friends to found our own start up im the realm of AI without any prior knowledge. Through your videos I was capable to get a basic overview about most of the important topics and to do my own research according to those outlines. So without taking into consideration if the start up fails or not, I am still great full for you and I guess the implications that I got out of your videos led to a path that will forever change my life. So thanks❤
Coming from video about LSTMs. Again, the explanation is so smooth. Everything is perfectly discussed. I find it immersively useful to refresh my knowledge base. Respect!
Absolutely amazing as always, thank you so much. Can't wait for attention and transformers lessons, it will again help me so much for my current internship !
Ideally it would be nice to just have one video that explains all the details, but I think for this specific topic it was pretty important to split the individual bits into separate videos since they can all stand on their own.
Omg I’m sooooooo happy that you are making videos on this!!! Have been heard it a lot but never figured it out until today 😂 cannot wait for the ones on attention and transformers ❤ Again thank you for making these awesome videos they really helped me A LOT
Hey, thanks for your awesome work in explaining these complex concepts concisely and clearly! However, I did have some confusion after watching this video for the first time (I cleared them by watching it several times) and wanted to share these notes with you since I think they could potentially make the video even better: 1. The "ir vamos y " tokens in the decoding layer are a bit misleading in two ways: a. I thought "ir" and "y" stood for the "¡" and "!" in "¡Vamos!" Thus, I was expecting the first output from the decoding layer to be "ir" instead of "vamos." b. The position of the "" token is also a bit misleading because I thought it was the end-of-sentence token for "¡Vamos!" and wondered why we would start from the end of the sentence. I think " ir vamos y" would have been easier to follow and would cause less confusion. 2. [6:20] One silly question I had at this part was, "Is each value of the 2-D embedding used as an input for each LSTM cell, or are the two values used twice as inputs for two cells?" Since 2 and 2 are such a great match, lol. 3. One important aspect that is missing, IMO, in several videos is how the training stage is done. Based on my understanding, what's explained in this video is the inference stage. I think training is also very worth explaining (basically how the networks learn the weights and biases in a certain model structure design). 4. Another tip is that I felt as the topic gets more complicated, it's worth making the video longer too. 16 minutes for this topic felt a little short for me. Anyways, this is still one of the best tutorial videos I've watched. Thank you for your effort!!
Sorry you had trouble with this video, but I'm glad you were able to finally figure things out. To answer your question, the 2 embedding values are used for both LSTMs in the first layer. (in other words, both LSTMs in the first layers get the exact same input values). If you understand the basics of backpropagation ( ua-cam.com/video/IN2XmBhILt4/v-deo.html ), then really all you need to know about how this model is trained is how "teacher-forcing" is used. Other than that, there's no difference from a normal Neural Network. That said, I also plan on creating a video where we code this exact network in PyTorch and in that video I'll show how this specific model is trained.
14:34 seems like a painful training, but one that, added to great compassion for other students, led you to produce those marvels of good education materials!
Another amazing video and I cannot thank you enough to help us understand neural network in a such friendly way! At 4:48, you mentioned "because the vocabulary contains a mix of words and symbols, we refer to the individual elements in a vocabulary as tokens" . I wonder if this applies to models like GPT when it's about "limits of the context length (e.g., GPT3.5, 4096 tokens) or control the output token size.
Yes, GPT models are based on tokens, however, tokens are usually word fragments, rather than whole words. That's why each word counts as more than one token.
Hi Josh, I have a question at time stamp 11:54. Why are we feeding the token to the decoder, shouldn't we feed the (start of sequence) token to initiate the translation? Thank you for sharing these world-class tutorials for free :) Cheers!
You can feed whatever you want into the decoder to get it initialized. I use because that is what they used in the original manuscript. But we could have used .
Once again, we can't appreciate you enough for the fantastic videos! I'd love a clarification if you don't mind. At 8:44 - 8:48, you mentioned that the decoder has LSTMs which have 2 layers and each layer has 2 cells. But, in the image on the screen, I can only see 1 cell per layer. Is there something I'm missing? Meanwhile, thanks a lot for replying on your videos. I was honored when you replied promptly to comments on your previous video. Looking forward to your response on this one.
Hi Josh - this one didn't really click for me. There's no 'aha' moment that I get with almost all your videos. I think we need to walk through the maths - or have a a follow up - even if it takes an hour. Perhaps a guest lecturer or willing student (happy to offer my time) ... alas I guess as the algorithms become more complex the less reasonable this becomes, however you did a masterful job simplifying CNN's that I've never seen elsewhere so I'm sure if anyone can do it, you can! Thanks regardless - there's a lot of joy in this community thanks to your teaching.
Yeah - it was a little bit of a bummer that I couldn't do the math all the way through. I'm working on something like that for Transformers and we'll see if I can pull it off. The math might have to be a separate video.
I'm a student who studies in Korea. I love your video and I appreciate that you made these videos. Can I ask you when does the video about 'Transformers' upload? It'll be big help for me to study NLP. Thank you.
Glad to hear that! I also have videos on transformers (which are the foundation of LLMs) here: ua-cam.com/video/zxQyTK8quyY/v-deo.html and ua-cam.com/video/bQ5BoolX9Ag/v-deo.html
Technically no. If you have more embedding values, you can add weights to the connections to an LSTM unit and then sum those products to get the desired number of input values. If you have fewer embedding values,, you can use extra weights to expand their number.
Hello, thank you for the wonderful tutorial once again. Just a question about word2vec output of embedding values, I'm a bit confused as to how we can input multiple embedding values from one word input into LSTM input. Unrolling it doesn't seem to make sense since its based on one word, if so, do we sum up all these embedding values into another layer of y=x and with weights associated them in order to get a single value for a single word input?
When we have multiple inputs to a single LSTM cell, extra connections to each subunit are created with additional weights for the new inputs. So, instead of just one connection from the input to the subunit that controls how much of the long-term memory to remember, we have one connection per input to that same subunit, each with its own weight. Likewise, extra connections are added from the inputs to all of the other subunits.
Hi Josh, Thanks for the much-needed content on encoder-decoder! :) However, I had a few questions/clarifications in mind: 1) Do the number of cells between each layer within the Encoder or Decoder be the same? 2) From the illustration of the model, the information from the second layer of the encoder will only flow to the second layer of the decoder. Is this understanding correct? 3) Building off from 2), does the number of cells from each layer of the Encoder have to be equal to the number of cells from each corresponding layer of the Decoder? 4) Do the number of layers between the decoder & encoder have to be the same? I think my main problem is trying to visualise the model architecture and how the information flows if there are different numbers of cells/layers. Like how would an encoder with 3 layers and 2 cells per layer connect to the decoder that perhaps have only 1 layer but 3 cells.
First, the important thing is that there are no rules in neural networks, just conventions. That said, in the original manuscript (and in pretty much every implementation), the number of LSTMs per layer and the number of layers are always equal in the Encoder and the Decoder - this makes it easy for the context vector to connect the two sets of LSTMs. However, if you want to come up with a different strategy, there are no rules that say you can't do it that way - you just have to figure out how to make it work.
Hello! Awesome video as everything from this channel, but I have a question: how do you calculate the amount of weights and biases of both your network and the original one? If you could break down how you did it, it would be very useful! Thanks!
@@benetramioicomas3785 I wrote the model in PyTorch and then printed out all trainable parameters with a "for" loop that also counted the number of trainable parameters. Specifically, I wrote this loop to print out all of the weights and biases: for name, param in model.named_parameters(): print(name, param.data) To count the number of weights and biases, I used this loop: total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
Thanks for the video Josh, it’s very clearly explained. I have a technical question about the Decoder, that I might have missed during the video. How can you dynamically change the sequence lenght fed to the Decoder? In other words, how can you unroll the decoder’s lstms? For instance, when you feed the token to the (let’s say, already trained) Decoder, and then you get and feed it together with the token, the length of the input sequence to the decoder dynamically grows from 1 () to 2 (+). The architecture of the NN cannot change, so I’m unsure on how to implement this. Cheers! 👍🏻👍🏻
When using the Encoder-Decoder for translation, you pass the tokens (or words) to the decoder one at a time. So we start by passing to the decoder and it predicts "vamos". So then we pass "vamos" (not + vamos) to the same decoder and repeat, passing one token to the decoder at a time until we get .
@@statquest Thanks for the reply. I see your point. Do you iterate then on the whole Encoder-Decoder model or just on the Decoder? In other words, is the input to the model Let’s + go + in the first iteration? Or do we just run the Encoder once to get the context vector and iterate over the Decoder, so that the input is just one word at a time (starting with )? In this last case, I assume we have to update the cell and hidden states for each new word we input to the Decoder
@@101alexmartin In this case, we have to calculate the values for input one word at a time, just like for the output - this is because the Long and Short Term memories have to be updated by each word sequentially. As you might imagine, this is a little bit of a computational bottleneck. And this bottleneck was one of the motivations for Transformers, which you can learn about here: ua-cam.com/video/zxQyTK8quyY/v-deo.html and here: ua-cam.com/video/bQ5BoolX9Ag/v-deo.html (NOTE: you might also want to watch this video on attention first: ua-cam.com/video/PSs6nxngL6k/v-deo.html )
@@statquest thanks for your reply. What do you mean by calculating the values for the input one word at a time? Do you mean that the input to the model in the first iteration would be [Let’s, go, EOS] and for the second iteration it would be [Let’s, go, vamos]? Or do you mean that you only use the Encoder once, to get the context vector output when you input [Let’s, go], and then you just focus on the Decoder, initializing it with the Encoder context vector in the first iteration, and then iterating over the Decoder (i.e over a LSTM architecture built for an input sequence length of 1), using the cell and hidden states of previous iterations to initialize the LSTM, until you get [EOS] as output?
@@101alexmartin What I mean is that we start by calculating the context vector (the long and short term memories) for "let's". Then we plug those values into the unrolled LSTMs that we use for "go", and keep doing that, calculating the context vector one word at a time, until we get to the end up of the input. Watching the video on Transformers may help you understand the distinction that I'm making here between doing things sequentially vs. in parallel.
Thank you Professor Josh, now I understand the working of Se2Seq models completely. If possible can you make a python based coding video either in Keras or Pytorch so that we can follow it completely through code? Thanks once again Professor Josh !
hi, 9:00 does the deco connects to the encoder 1 on 1? or do we have to connect each deco output to each encoder input all to all fully connected fashion?
The connections are the exact same as they are within the encoder when we unroll the LSTMs - the long-term memories (cell states) that come out of one LSTM are connected the long-term memories of the next LSTM - the short term memories (hidden states) that come out of one LSTM are connected to the short-term memories of the next LSTM.
can someone explain to me more thoroughly what is the purpose of the multiple layers with multiple LSTM cells of the encoder-decoder model for seq2seq problems because i didn't understand it too well from the video as the explanation was too vague. but still it's a great video 👍
We use multiple layers and multiple LSTMs so that we can have more parameters to fit the model to the data. The more parameters we have, the more complicated a dataset we can train the model on.
Hey Josh thank you so much for this video. It really helps in understanding the concept behind the working of encoder and decoder. I have a question though. Here we translated Let's go into Vamos which came as a result of the softmax function in the last dense layer. What if the phrase we want to translate is of less length than in Spanish. What I mean is if we HAVE 1 word in english which translates to more than 1 word in spanish. How will the softmax function give us the result then. It might be a silly question but as you say Always be Curious(ABC)
Please do a series on time series forecasting with fourier components (short-time fourier transform) and how to combine multiple frame-length stft outputs into a single inversion call (wavelets?)
Oups 🙊 What is « *Seq2Seq* » I must go watch *Long Short Term-Memory* I think I will have to check out the quest also *Word Embedding and Word2Vec…* and then I will be happy to come back to learn with Josh 😅 I am impatient to learn *Attention for Neural Networks* _Clearly Explained_
This channel is great. I have loved the series so far, thank you very much! I have a question: Why do we need a second layer for the encoder and decoder? Could I have achieved the same result using only 1 layer?
Didn't understand the part that when we are using 2 LSTM cells per layer, Since the input to these states is the same and we are training it the same way why would the weight parameters be any different. Pls correct me if I'm wrong.
Hello Sir, I was going through your stats videos (qq plot, distribution etc)and loved your content. I would be really grateful, if you can make something regarding a worm plot. Nothing comes up on youtube when I search it.
Thanks for the unvaluable content. I'm confused about one thing: do the "cell state" and "hidden state" memories that we're dealing with consist of scalars or vectors (you express them by numbers while some other sources by vectors)? And in your example above, how many LSTM units have you used? 2 units (1 per layer)?, I heared that the size of vector is equal to the size of the LSTM unit
Each LSTM outputs a value for the cell state and a value for the hidden state. These are scalars that can be concatenated into a vector if you have multiple LSTMs. In this example, we have 4 LSTMs, 2 layers of 2 units each.
Dr. Starmer thanks for the video and I had a doubt about this one. While I could understand the training cycle of the model I ain't quite sure about how inference testing is done, because during inference there wont be any tokens to be fed into the decoder side of the model, then how would it come up with a response? If I have to keep it crisp I couldnt understand how the architecture distinguishes training from inference? Is there some signal passed into the decoder side of the model.
For inference, we provide the context vector from the encoder and provide a start token () to the decoder, and then, based on that, the decoder creates an output token. If that token is , it's done, otherwise it takes that token as input the decoder again, etc...
Hey Josh, thanks for another great video! I am not quite sure how the second cell works within one layer though. Is it similar to adding another node within the same layer as in the vanilla neural network model and then the two cells output will be weighted and summed up? Or it's a different concept?
The second cell works independently, and it's outputs are also independent. All they share are the same inputs. The fully connected layer at the end merges everything together.
@@statquest Thanks Josh. I have another question: what is the benefit of using the LSTM for the encoder? My understanding of LSTM is that it can predict a value based a series of historical values that are related to each, such that the long/short term memories keep being refreshed. However, in seq2seq case, let's and go don't seem to be sequentially related. So why still add them as input in a sequential way (via unrolling network), rather than input these two words' embedding into two parallel networks?
In this full network where does we are telling to convert English word to Spanish word? for example in LSTM OR in Neural network before SoftMax function?
What is the activation function used in the output fully connected layer (between the final short-term memories and the inputs to the Softmax)? Is it an identity activation gate? I see in various documentations "linear", "affine", etc.
First of all, thank you so much for the clear explanation! I was confused when you said in the decoder during training that the next word we will give to the LSTM is not the predicted word, but we will use the word in training data. How will you let the network know whether the predicted token is correct?
I'm working on a video on how to code and train these networks that will help make this clear. In the mean time, know that we just compare all of the predicted output values to what we know should be the output values.
I loved the video. However, I have a few questions. The paper says "Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performancemarkedly". But in this video you have reversed the order of the target sequence, Right? Also, how can the outputs from the first LSTM layer in the encoder be directly connected to the first LSTM layer of the decoder if we have a stack of two LSTM layers in the encoder part?
In this video I didn't reverse anything - it might look like the decoder is doing things in reverse because the decoder is initialized with the token, but if you look at the outputs, you'll see that the output is not reversed. And, in this example, both the encoder and the decoder have the same number of LSTMs and stacks of LSTMs, so things can be connected directly. If we have a different number, we can use a simple "fully connected layer" to change the number of outputs from the encoder LSTMs to match the inputs to the decoder LSTMs.
@@statquest Thank you so much for replying. I would really appreciate it if you could confirm if the context vector of this simplified model uses 8 numbers? because we have 2 layers with 2 LSTM cells and each cell would have 2 final states (a final long-term memory (c) and a final short-term memory (h))? also, is the final state of first LSTM in the first layer of encoder used to initialize c and h of the first LSTM in the first layer of the decoder? I mean each LSTM cell transfers its c and h to the corresponding cell in the decoder?
@@sukhsehajkaur1731 Yes. And, if you look at the illustration in the video, you'll see 8 lines going from the encoder to the decoder. And yes. This is also illustrated in the video.
To learn more about Lightning: lightning.ai/
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
One of the best channels on youtube! Wanted to provide some constructive criticism: Either I am blind or you have forgotten to link the og paper you show in the video in the video description.
@@graedy2 Here it is: arxiv.org/abs/1409.3215
@@statquest Is sequence to sequence model and encoder decoder architecture are both a same concept or they are different from each other?
Can you resolve this problem asap? I'm having my exams tomorrow.....😶🌫️
@@atharv1818_ Yes. That question should be answered in the first few minutes of this video.
This channel is like the Khan Academy of neural networks, machine learning, and statistics. Truly remarkable explanations
Thank you!
it's way better :) khan Academy does not have such cool songs =:)
I literally searched everywhere and finally came across your channel. seems like gradient descent worked fine .
:)
This channel is gold. I remember how, for my first coding job, where I had no programming knowledge (lol) but had no choice than to take it anyways, I quickly had to learn php and mysql. To get myself started, I searched for the simplest php coding books and then got myself two books from the php & mysql for kids series, even though I was already in my mid twenties. Long story short, I quickly learned the basics, and did code for a living. Complex topics don't have to be complex, in fact they are always built on building blocks of simple concepts and can be explained and taught as such IMHO. Thank you so much for explaining it KISS style. Because once again, I have to learn machine learning more or less from scratch, but this time for my own personal projects.
BAM! I'm glad my videos are helpful. :)
Imagine a world without this yt channel.
:)
It took me more than 16 minutes (the length of the video) to get what happens since I have to pause the video to think, but I should say it is very clearly explained! Love your video!!
Hooray! I'm glad the video was helpful. Now that you understand Seq2Seq, I bet you could understand Transformers relatively easily: ua-cam.com/video/zxQyTK8quyY/v-deo.html
I am nearly finished with the ML playlist, and I must say that I am extremely satisfied with the content and I do regret spending so much time on other videos and courses before finding your channel. Thank you for creating this invaluable resource. I hope you continue to create videos on more complex subjects, particularly in the medical field. Once again, thank you very much.
Thank you very much!
I can't thank you enough for these tutorials on NLP. From the first tutorial related to RNNs to this tutorial, you explained so concisely and clearly notions that I have struggled and was scared to tackle for couple of weeks, due to the amount of papers/tutorials someone should read/watch in order to be up to date with the most recent advancement in NLP/ASR. You jump-started my journey and made it much more pleasant! Thank you so much!
Glad I could help!
Another great explanation!
It is so comforting to know that whatever I don't understand in class, I can always find a video in your channel and be confident that I will understand by the end.
Thank you!
Glad it was helpful!
This is amazing. Can't wait for the Transormers tutorial to be released.
Thanks!
An awesome video as always! Super excited for videos on attention, transformers and LLM. In the era of AI and ChatGPT, these are going to go viral, making this knowledge accessible to more people, explained in a much simpler manner.
Thanks!
Wonderful tutorial! Studying on Statquest is really like a recursive process. I first search for transformers, then follow the links below all the way to RNN, and finally study backward all the way to the top! That is a really good learning experience thanks!
Hooray! I'm glad these videos are helpful. By the way, here's the link to the transformers video: ua-cam.com/video/zxQyTK8quyY/v-deo.html
I got my finals of my final course in my final day tomorrow of my undergraduate journey and you posted this exactly few hours ago.. thats a triple final bam for me
Good luck! :)
exact same situation bro
I just wanted to mention that I really love and appreciate you as well as your content. You have been an incredible inspiration for me and my friends to found our own start up im the realm of AI without any prior knowledge. Through your videos I was capable to get a basic overview about most of the important topics and to do my own research according to those outlines. So without taking into consideration if the start up fails or not, I am still great full for you and I guess the implications that I got out of your videos led to a path that will forever change my life. So thanks❤
BAM! And good luck with the start up!!!
Coming from video about LSTMs. Again, the explanation is so smooth. Everything is perfectly discussed. I find it immersively useful to refresh my knowledge base. Respect!
Glad it was helpful!
I genuinely love you for these videos holy smokes
BAM! :)
Thanks, Saved my Natural Language Processing Exam
BAM!
Been waiting for this for so long. ❤. Thank you Josh.
Hooray! :)
Love your videos Josh! Thanks for sharing all your knowledge in such a concise way.
Thank you! :)
Amazing! Can't wait to check out the Self-Attention and Transformers 'Quests!
Thanks! :)
Absolutely amazing as always, thank you so much. Can't wait for attention and transformers lessons, it will again help me so much for my current internship !
bam!
Incredible, Josh. This is exactly what I needed right now!
BAM! :)
I like how your videos backpropogate so I have to watch all of them if I want to understand one.
Ideally it would be nice to just have one video that explains all the details, but I think for this specific topic it was pretty important to split the individual bits into separate videos since they can all stand on their own.
@@statquest Yeah it's a fair point and thankyou so much for making these videos.
Omg I’m sooooooo happy that you are making videos on this!!! Have been heard it a lot but never figured it out until today 😂 cannot wait for the ones on attention and transformers ❤ Again thank you for making these awesome videos they really helped me A LOT
Thank you very much! :)
BAM!! Loved the quest, as always!
Thank you so much! :)
Hey, thanks for your awesome work in explaining these complex concepts concisely and clearly! However, I did have some confusion after watching this video for the first time (I cleared them by watching it several times) and wanted to share these notes with you since I think they could potentially make the video even better:
1. The "ir vamos y " tokens in the decoding layer are a bit misleading in two ways:
a. I thought "ir" and "y" stood for the "¡" and "!" in "¡Vamos!" Thus, I was expecting the first output from the decoding layer to be "ir" instead of "vamos."
b. The position of the "" token is also a bit misleading because I thought it was the end-of-sentence token for "¡Vamos!" and wondered why we would start from the end of the sentence. I think " ir vamos y" would have been easier to follow and would cause less confusion.
2. [6:20] One silly question I had at this part was, "Is each value of the 2-D embedding used as an input for each LSTM cell, or are the two values used twice as inputs for two cells?" Since 2 and 2 are such a great match, lol.
3. One important aspect that is missing, IMO, in several videos is how the training stage is done. Based on my understanding, what's explained in this video is the inference stage. I think training is also very worth explaining (basically how the networks learn the weights and biases in a certain model structure design).
4. Another tip is that I felt as the topic gets more complicated, it's worth making the video longer too. 16 minutes for this topic felt a little short for me.
Anyways, this is still one of the best tutorial videos I've watched. Thank you for your effort!!
Sorry you had trouble with this video, but I'm glad you were able to finally figure things out. To answer your question, the 2 embedding values are used for both LSTMs in the first layer. (in other words, both LSTMs in the first layers get the exact same input values). If you understand the basics of backpropagation ( ua-cam.com/video/IN2XmBhILt4/v-deo.html ), then really all you need to know about how this model is trained is how "teacher-forcing" is used. Other than that, there's no difference from a normal Neural Network. That said, I also plan on creating a video where we code this exact network in PyTorch and in that video I'll show how this specific model is trained.
Can't wait to learn the coding part from you too. And thanks for your patient reply to every comment. It's amazing. @@statquest
14:34 seems like a painful training, but one that, added to great compassion for other students, led you to produce those marvels of good education materials!
Thank you!
Wow man, triple bam indeed, the concept is crystal clear to me now !
Thanks!
Great video! thanks for producing such a high quality, clear and yet simple tutorial
Thank you!
Another amazing video and I cannot thank you enough to help us understand neural network in a such friendly way!
At 4:48, you mentioned "because the vocabulary contains a mix of words and symbols, we refer to the individual elements in a vocabulary as tokens" . I wonder if this applies to models like GPT when it's about "limits of the context length (e.g., GPT3.5, 4096 tokens) or control the output token size.
Yes, GPT models are based on tokens, however, tokens are usually word fragments, rather than whole words. That's why each word counts as more than one token.
Thank you, so I now can have intuition of why the name is encoder and decoder, that I've curious for full 1 years.
bam! :)
See this is the kind of explanation I was waiting for❤
bam!
Been waiting for this from you. Love it.
Thanks!
Hi Josh, I have a question at time stamp 11:54.
Why are we feeding the token to the decoder, shouldn't we feed the (start of sequence) token to initiate the translation?
Thank you for sharing these world-class tutorials for free :)
Cheers!
You can feed whatever you want into the decoder to get it initialized. I use because that is what they used in the original manuscript. But we could have used .
Yaas more on Transformers! Waiting for statquest illustrated book on those topics!
I'm working on it! :)
谢谢!
TRIPLE BAM!!! Thank you for supporting StatQuest!!! :)
Thanks!
Hooray!!! Thank you so much for supporting StatQuest!!! TRIPLE BAM!!! :)
Once again, we can't appreciate you enough for the fantastic videos!
I'd love a clarification if you don't mind. At 8:44 - 8:48, you mentioned that the decoder has LSTMs which have 2 layers and each layer has 2 cells. But, in the image on the screen, I can only see 1 cell per layer. Is there something I'm missing?
Meanwhile, thanks a lot for replying on your videos. I was honored when you replied promptly to comments on your previous video. Looking forward to your response on this one.
The other LSTMs are there, just hard to see.
Can't wait to see the stanford parser head structure explained as a step towards attention!
I'll keep that in mind.
This video is awesome,they helped me a lot.Thank you very much.
Thanks!
Best ML vids out there, thanks!
Wow, thanks!
Hi Josh - this one didn't really click for me. There's no 'aha' moment that I get with almost all your videos. I think we need to walk through the maths - or have a a follow up - even if it takes an hour. Perhaps a guest lecturer or willing student (happy to offer my time) ... alas I guess as the algorithms become more complex the less reasonable this becomes, however you did a masterful job simplifying CNN's that I've never seen elsewhere so I'm sure if anyone can do it, you can! Thanks regardless - there's a lot of joy in this community thanks to your teaching.
Yeah - it was a little bit of a bummer that I couldn't do the math all the way through. I'm working on something like that for Transformers and we'll see if I can pull it off. The math might have to be a separate video.
These videos are doing god's work. Nothing even comes close.
Thank you!
you posted this video when I needed the most Thanks man and really awesome 👍🏻
HOORAY!!! BAM! :)
this is my homework assignment today. how did youtube know to put this in my feed? maybe the next statquest will explain. 😂
bam! :)
I'm a student who studies in Korea. I love your video and I appreciate that you made these videos. Can I ask you when does the video about 'Transformers' upload? It'll be big help for me to study NLP. Thank you.
I'm working on it right now, so it will, hopefully, be out sometime in June.
using as the first input in the decoder to start the whole translation does appear to be magical
It's essentially a placeholder to get the translation started. You could probably start with anything, as long as you were consistent.
Best channel ever ❤
Thank you! :)
thank you so much, I learn from this vedio a lot about LLM
Glad to hear that! I also have videos on transformers (which are the foundation of LLMs) here: ua-cam.com/video/zxQyTK8quyY/v-deo.html and ua-cam.com/video/bQ5BoolX9Ag/v-deo.html
This channel is awesome. Thank you
Thanks!
you are an excellent teacher
Thank you! 😃
Thank you Joshhh !!! I really love the way you teach everything
Thank you!
Awesome Video. Please make a video on GAN and BPTT. Request.....
I'll keep those topics in mind.
@@statquest Thank you sir.
Do you need the same number of lstm cells as there are embedding values?
Technically no. If you have more embedding values, you can add weights to the connections to an LSTM unit and then sum those products to get the desired number of input values. If you have fewer embedding values,, you can use extra weights to expand their number.
Hello, thank you for the wonderful tutorial once again. Just a question about word2vec output of embedding values, I'm a bit confused as to how we can input multiple embedding values from one word input into LSTM input. Unrolling it doesn't seem to make sense since its based on one word, if so, do we sum up all these embedding values into another layer of y=x and with weights associated them in order to get a single value for a single word input?
Or do we use each individual embedding value as input for different LSTM cell? (Which would mean that we can have 100-1000+ LSTM cells per word)
When we have multiple inputs to a single LSTM cell, extra connections to each subunit are created with additional weights for the new inputs. So, instead of just one connection from the input to the subunit that controls how much of the long-term memory to remember, we have one connection per input to that same subunit, each with its own weight. Likewise, extra connections are added from the inputs to all of the other subunits.
awesome vid as always Josh :)
Thank you!
Hi Josh,
Thanks for the much-needed content on encoder-decoder! :)
However, I had a few questions/clarifications in mind:
1) Do the number of cells between each layer within the Encoder or Decoder be the same?
2) From the illustration of the model, the information from the second layer of the encoder will only flow to the second layer of the decoder. Is this understanding correct?
3) Building off from 2), does the number of cells from each layer of the Encoder have to be equal to the number of cells from each corresponding layer of the Decoder?
4) Do the number of layers between the decoder & encoder have to be the same?
I think my main problem is trying to visualise the model architecture and how the information flows if there are different numbers of cells/layers. Like how would an encoder with 3 layers and 2 cells per layer connect to the decoder that perhaps have only 1 layer but 3 cells.
First, the important thing is that there are no rules in neural networks, just conventions. That said, in the original manuscript (and in pretty much every implementation), the number of LSTMs per layer and the number of layers are always equal in the Encoder and the Decoder - this makes it easy for the context vector to connect the two sets of LSTMs. However, if you want to come up with a different strategy, there are no rules that say you can't do it that way - you just have to figure out how to make it work.
Hello! Awesome video as everything from this channel, but I have a question: how do you calculate the amount of weights and biases of both your network and the original one? If you could break down how you did it, it would be very useful! Thanks!
I'm not sure I understand your question. Are you asking how the weights and biases are trained?
No, in the video, in the minute 15:48, you say that your model has 220 weights and biases. How do you calculaamte this number?
@@benetramioicomas3785 I wrote the model in PyTorch and then printed out all trainable parameters with a "for" loop that also counted the number of trainable parameters. Specifically, I wrote this loop to print out all of the weights and biases:
for name, param in model.named_parameters():
print(name, param.data)
To count the number of weights and biases, I used this loop:
total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
Thanks for the video Josh, it’s very clearly explained.
I have a technical question about the Decoder, that I might have missed during the video. How can you dynamically change the sequence lenght fed to the Decoder? In other words, how can you unroll the decoder’s lstms? For instance, when you feed the token to the (let’s say, already trained) Decoder, and then you get and feed it together with the token, the length of the input sequence to the decoder dynamically grows from 1 () to 2 (+). The architecture of the NN cannot change, so I’m unsure on how to implement this.
Cheers! 👍🏻👍🏻
When using the Encoder-Decoder for translation, you pass the tokens (or words) to the decoder one at a time. So we start by passing to the decoder and it predicts "vamos". So then we pass "vamos" (not + vamos) to the same decoder and repeat, passing one token to the decoder at a time until we get .
@@statquest Thanks for the reply. I see your point. Do you iterate then on the whole Encoder-Decoder model or just on the Decoder? In other words, is the input to the model Let’s + go + in the first iteration? Or do we just run the Encoder once to get the context vector and iterate over the Decoder, so that the input is just one word at a time (starting with )? In this last case, I assume we have to update the cell and hidden states for each new word we input to the Decoder
@@101alexmartin In this case, we have to calculate the values for input one word at a time, just like for the output - this is because the Long and Short Term memories have to be updated by each word sequentially. As you might imagine, this is a little bit of a computational bottleneck. And this bottleneck was one of the motivations for Transformers, which you can learn about here: ua-cam.com/video/zxQyTK8quyY/v-deo.html and here: ua-cam.com/video/bQ5BoolX9Ag/v-deo.html (NOTE: you might also want to watch this video on attention first: ua-cam.com/video/PSs6nxngL6k/v-deo.html )
@@statquest thanks for your reply. What do you mean by calculating the values for the input one word at a time? Do you mean that the input to the model in the first iteration would be [Let’s, go, EOS] and for the second iteration it would be [Let’s, go, vamos]? Or do you mean that you only use the Encoder once, to get the context vector output when you input [Let’s, go], and then you just focus on the Decoder, initializing it with the Encoder context vector in the first iteration, and then iterating over the Decoder (i.e over a LSTM architecture built for an input sequence length of 1), using the cell and hidden states of previous iterations to initialize the LSTM, until you get [EOS] as output?
@@101alexmartin What I mean is that we start by calculating the context vector (the long and short term memories) for "let's". Then we plug those values into the unrolled LSTMs that we use for "go", and keep doing that, calculating the context vector one word at a time, until we get to the end up of the input. Watching the video on Transformers may help you understand the distinction that I'm making here between doing things sequentially vs. in parallel.
you are my NEW GOD 😇😶🌫
:)
Thank you Professor Josh, now I understand the working of Se2Seq models completely. If possible can you make a python based coding video either in Keras or Pytorch so that we can follow it completely through code? Thanks once again Professor Josh !
I'm working on the PyTorch Lightning videos right now.
Thanks@@statquest
This is pretty cool!
Thanks!
We are getting to Transformers. LEETS GOOO
:)
Vamosssss. 😂
love your songs so much
Thank you! :)
You are amazing TRIPLEBAAAAAMMMM
Thanks!
cant wait for the tranformers video
Me too. I'm working on it right now.
yoooo Lets goooooo , Josh posted !
bam! :)
hi, 9:00 does the deco connects to the encoder 1 on 1?
or do we have to connect each deco output to each encoder input all to all fully connected fashion?
The connections are the exact same as they are within the encoder when we unroll the LSTMs - the long-term memories (cell states) that come out of one LSTM are connected the long-term memories of the next LSTM - the short term memories (hidden states) that come out of one LSTM are connected to the short-term memories of the next LSTM.
This video is very helpful... BAM!
Thank you!
Wow. Splendid!..
Thank you! :)
can someone explain to me more thoroughly what is the purpose of the multiple layers with multiple LSTM cells of the encoder-decoder model for seq2seq problems because i didn't understand it too well from the video as the explanation was too vague. but still it's a great video 👍
We use multiple layers and multiple LSTMs so that we can have more parameters to fit the model to the data. The more parameters we have, the more complicated a dataset we can train the model on.
Hey Josh thank you so much for this video. It really helps in understanding the concept behind the working of encoder and decoder. I have a question though. Here we translated Let's go into Vamos which came as a result of the softmax function in the last dense layer. What if the phrase we want to translate is of less length than in Spanish. What I mean is if we HAVE 1 word in english which translates to more than 1 word in spanish. How will the softmax function give us the result then. It might be a silly question but as you say Always be Curious(ABC)
You just keep unrolling the decoder to any length needed.
What is the logic behind using multiple layers and multiple cells in each layers?
The more layers and cells we use, the more weights and biases we have to fit to the model to the data.
Really well explained! Thnx! :D
Thank you!
Please do a series on time series forecasting with fourier components (short-time fourier transform) and how to combine multiple frame-length stft outputs into a single inversion call (wavelets?)
I'll keep that in mind, but I might not be able to get to it soon.
Great explanation!
Thanks!
Oups 🙊 What is « *Seq2Seq* » I must go watch *Long Short Term-Memory* I think I will have to check out the quest also *Word Embedding and Word2Vec…* and then I will be happy to come back to learn with Josh 😅 I am impatient to learn *Attention for Neural Networks* _Clearly Explained_
This channel is great. I have loved the series so far, thank you very much!
I have a question:
Why do we need a second layer for the encoder and decoder? Could I have achieved the same result using only 1 layer?
Yes. I just wanted to show how the layers worked.
Great lecture Josh!!! What is the significance of using multiple LSTM cells since we already have multiple embeddings for each word?
TIA
The word embeddings tell us about the individual words. The LSTM cells tell us how the words are related to each other - they capture the context.
You're videos are really amazing... ❤ Can you make a video on boltzmann machines?
I'll keep that in mind.
I liked it a lot. thanks ❤
Thank you! :)
1:57 5:25 5:48 5:56❗ 6:07 6:12 6:41 8:08 8:20 8:30
bam!
Didn't understand the part that when we are using 2 LSTM cells per layer, Since the input to these states is the same and we are training it the same way why would the weight parameters be any different. Pls correct me if I'm wrong.
The parameters would be different because they started with different random initial values.
Great thanks for the reply, means a lot.
😀😀😀Love it
Double thanks! :)
Hey... Hope u r doing good.....
So u are about to reach MEGA BAMMMMM
Yes! I can't wait! :)
Hello Sir, I was going through your stats videos (qq plot, distribution etc)and loved your content. I would be really grateful, if you can make something regarding a worm plot. Nothing comes up on youtube when I search it.
I'll keep that in mind.
Thanks for the unvaluable content.
I'm confused about one thing: do the "cell state" and "hidden state" memories that we're dealing with consist of scalars or vectors (you express them by numbers while some other sources by vectors)?
And in your example above, how many LSTM units have you used? 2 units (1 per layer)?, I heared that the size of vector is equal to the size of the LSTM unit
Each LSTM outputs a value for the cell state and a value for the hidden state. These are scalars that can be concatenated into a vector if you have multiple LSTMs. In this example, we have 4 LSTMs, 2 layers of 2 units each.
Love your vids
Thanks!
Thank you, Josh. You are amazing.
Would you please teach Graph Neural Networks?
I'll keep that in mind.
Dr. Starmer thanks for the video and I had a doubt about this one. While I could understand the training cycle of the model I ain't quite sure about how inference testing is done, because during inference there wont be any tokens to be fed into the decoder side of the model, then how would it come up with a response?
If I have to keep it crisp I couldnt understand how the architecture distinguishes training from inference? Is there some signal passed into the decoder side of the model.
For inference, we provide the context vector from the encoder and provide a start token () to the decoder, and then, based on that, the decoder creates an output token. If that token is , it's done, otherwise it takes that token as input the decoder again, etc...
Hey Josh, thanks for another great video! I am not quite sure how the second cell works within one layer though. Is it similar to adding another node within the same layer as in the vanilla neural network model and then the two cells output will be weighted and summed up? Or it's a different concept?
The second cell works independently, and it's outputs are also independent. All they share are the same inputs. The fully connected layer at the end merges everything together.
@@statquest Thanks Josh. I have another question: what is the benefit of using the LSTM for the encoder? My understanding of LSTM is that it can predict a value based a series of historical values that are related to each, such that the long/short term memories keep being refreshed. However, in seq2seq case, let's and go don't seem to be sequentially related. So why still add them as input in a sequential way (via unrolling network), rather than input these two words' embedding into two parallel networks?
In this full network where does we are telling to convert English word to Spanish word? for example in LSTM OR in Neural network before SoftMax function?
The whole thing does the job. There is no single part that does the translation.
What is the activation function used in the output fully connected layer (between the final short-term memories and the inputs to the Softmax)? Is it an identity activation gate? I see in various documentations "linear", "affine", etc.
In this case I used the identity function.
Great explanation, love it!
PS do you have a suggestion for where I can learn to work with seq2seq with tensorflow?
Unfortunately I don't. :(
can you share the code if you find how to work with seq2seq with tensorflow Please?
perfect as usual🦾
Thank you!
Hello Josh !!😊
Hello! :)
Hi Josh. Are the 2 embeddings added up before it goes as an input to lstm?
They are multiplied by individual weights then summed and then a bias is added. The weights and bias are trained with backpropagation.
First of all, thank you so much for the clear explanation!
I was confused when you said in the decoder during training that the next word we will give to the LSTM is not the predicted word, but we will use the word in training data. How will you let the network know whether the predicted token is correct?
I'm working on a video on how to code and train these networks that will help make this clear. In the mean time, know that we just compare all of the predicted output values to what we know should be the output values.
@@statquest thank you so much!
300 million bams! ❤
Thank you!
I loved the video. However, I have a few questions. The paper says "Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performancemarkedly". But in this video you have reversed the order of the target sequence, Right? Also, how can the outputs from the first LSTM layer in the encoder be directly connected to the first LSTM layer of the decoder if we have a stack of two LSTM layers in the encoder part?
In this video I didn't reverse anything - it might look like the decoder is doing things in reverse because the decoder is initialized with the token, but if you look at the outputs, you'll see that the output is not reversed. And, in this example, both the encoder and the decoder have the same number of LSTMs and stacks of LSTMs, so things can be connected directly. If we have a different number, we can use a simple "fully connected layer" to change the number of outputs from the encoder LSTMs to match the inputs to the decoder LSTMs.
@@statquest Thank you so much for replying. I would really appreciate it if you could confirm if the context vector of this simplified model uses 8 numbers? because we have 2 layers with 2 LSTM cells and each cell would have 2 final states (a final long-term memory (c) and a final short-term memory (h))? also, is the final state of first LSTM in the first layer of encoder used to initialize c and h of the first LSTM in the first layer of the decoder? I mean each LSTM cell transfers its c and h to the corresponding cell in the decoder?
@@sukhsehajkaur1731 Yes. And, if you look at the illustration in the video, you'll see 8 lines going from the encoder to the decoder. And yes. This is also illustrated in the video.
@@statquest Thanks a lot for the confirmation. Kudos to your hard work!