Sequence-to-Sequence (seq2seq) Encoder-Decoder Neural Networks, Clearly Explained!!!

Поділитися
Вставка
  • Опубліковано 4 чер 2024
  • In this video, we introduce the basics of how Neural Networks translate one language, like English, to another, like Spanish. The ideas is to convert one sequence of things into another sequence of things, and thus, this type of neural network can be applied to all sort so of problems, including translating amino acids into 3-dimensional structures.
    NOTE: This StatQuest assumes that you are already familiar with...
    Long, Short-Term Memory (LSTM): • Long Short-Term Memory...
    ...and...
    Word Embedding: • Word Embedding and Wor...
    Also, if you'd like to go through Ben Trevett's tutorials, see: github.com/bentrevett/pytorch...
    Finally, here's a link to the original manuscript: arxiv.org/abs/1409.3215
    If you'd like to support StatQuest, please consider...
    Patreon: / statquest
    ...or...
    UA-cam Membership: / @statquest
    ...buying my book, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
    statquest.org/statquest-store/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    3:43 Building the Encoder
    8:27 Building the Decoder
    12:58 Training The Encoder-Decoder Model
    14:40 My model vs the model from the original manuscript
    #StatQuest #seq2seq #neuralnetwork

КОМЕНТАРІ • 312

  • @statquest
    @statquest  Рік тому +8

    To learn more about Lightning: lightning.ai/
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

    • @graedy2
      @graedy2 8 днів тому +1

      One of the best channels on youtube! Wanted to provide some constructive criticism: Either I am blind or you have forgotten to link the og paper you show in the video in the video description.

    • @statquest
      @statquest  8 днів тому +1

      @@graedy2 Here it is: arxiv.org/abs/1409.3215

  • @tornadospin9
    @tornadospin9 Рік тому +111

    This channel is like the Khan Academy of neural networks, machine learning, and statistics. Truly remarkable explanations

    • @statquest
      @statquest  Рік тому +6

      Thank you!

    • @eliaborras9834
      @eliaborras9834 5 місяців тому +3

      it's way better :) khan Academy does not have such cool songs =:)

  • @reinerheiner1148
    @reinerheiner1148 Рік тому +43

    This channel is gold. I remember how, for my first coding job, where I had no programming knowledge (lol) but had no choice than to take it anyways, I quickly had to learn php and mysql. To get myself started, I searched for the simplest php coding books and then got myself two books from the php & mysql for kids series, even though I was already in my mid twenties. Long story short, I quickly learned the basics, and did code for a living. Complex topics don't have to be complex, in fact they are always built on building blocks of simple concepts and can be explained and taught as such IMHO. Thank you so much for explaining it KISS style. Because once again, I have to learn machine learning more or less from scratch, but this time for my own personal projects.

    • @statquest
      @statquest  Рік тому +3

      BAM! I'm glad my videos are helpful. :)

  • @cat-a-lyst
    @cat-a-lyst 11 місяців тому +11

    I literally searched everywhere and finally came across your channel. seems like gradient descent worked fine .

  • @gabip265
    @gabip265 11 місяців тому +9

    I can't thank you enough for these tutorials on NLP. From the first tutorial related to RNNs to this tutorial, you explained so concisely and clearly notions that I have struggled and was scared to tackle for couple of weeks, due to the amount of papers/tutorials someone should read/watch in order to be up to date with the most recent advancement in NLP/ASR. You jump-started my journey and made it much more pleasant! Thank you so much!

    • @statquest
      @statquest  11 місяців тому

      Glad I could help!

  • @diamondstep3957
    @diamondstep3957 Рік тому +2

    Love your videos Josh! Thanks for sharing all your knowledge in such a concise way.

  • @rachit7185
    @rachit7185 Рік тому +14

    An awesome video as always! Super excited for videos on attention, transformers and LLM. In the era of AI and ChatGPT, these are going to go viral, making this knowledge accessible to more people, explained in a much simpler manner.

  • @paulk6900
    @paulk6900 Рік тому +4

    I just wanted to mention that I really love and appreciate you as well as your content. You have been an incredible inspiration for me and my friends to found our own start up im the realm of AI without any prior knowledge. Through your videos I was capable to get a basic overview about most of the important topics and to do my own research according to those outlines. So without taking into consideration if the start up fails or not, I am still great full for you and I guess the implications that I got out of your videos led to a path that will forever change my life. So thanks❤

    • @statquest
      @statquest  Рік тому

      BAM! And good luck with the start up!!!

  • @m.taufiqaffandi
    @m.taufiqaffandi Рік тому +8

    This is amazing. Can't wait for the Transormers tutorial to be released.

  • @juliali3081
    @juliali3081 6 місяців тому +4

    It took me more than 16 minutes (the length of the video) to get what happens since I have to pause the video to think, but I should say it is very clearly explained! Love your video!!

    • @statquest
      @statquest  6 місяців тому

      Hooray! I'm glad the video was helpful. Now that you understand Seq2Seq, I bet you could understand Transformers relatively easily: ua-cam.com/video/zxQyTK8quyY/v-deo.html

  • @mateuszsmendowski2677
    @mateuszsmendowski2677 11 місяців тому +1

    Coming from video about LSTMs. Again, the explanation is so smooth. Everything is perfectly discussed. I find it immersively useful to refresh my knowledge base. Respect!

    • @statquest
      @statquest  11 місяців тому +1

      Glad it was helpful!

  • @ligezhang4735
    @ligezhang4735 10 місяців тому +1

    Wonderful tutorial! Studying on Statquest is really like a recursive process. I first search for transformers, then follow the links below all the way to RNN, and finally study backward all the way to the top! That is a really good learning experience thanks!

    • @statquest
      @statquest  10 місяців тому

      Hooray! I'm glad these videos are helpful. By the way, here's the link to the transformers video: ua-cam.com/video/zxQyTK8quyY/v-deo.html

  • @sheiphanshaijan1249
    @sheiphanshaijan1249 Рік тому +3

    Been waiting for this for so long. ❤. Thank you Josh.

  • @AI_Financier
    @AI_Financier Рік тому +1

    Great video! thanks for producing such a high quality, clear and yet simple tutorial

  • @ZinzinsIA
    @ZinzinsIA Рік тому +1

    Absolutely amazing as always, thank you so much. Can't wait for attention and transformers lessons, it will again help me so much for my current internship !

  • @MCMelonslice
    @MCMelonslice Рік тому +2

    Incredible, Josh. This is exactly what I needed right now!

  • @KR-fy3ls
    @KR-fy3ls Рік тому +2

    Been waiting for this from you. Love it.

  • @shafiullm
    @shafiullm Рік тому +2

    I got my finals of my final course in my final day tomorrow of my undergraduate journey and you posted this exactly few hours ago.. thats a triple final bam for me

  • @fancytoadette
    @fancytoadette Рік тому +2

    Omg I’m sooooooo happy that you are making videos on this!!! Have been heard it a lot but never figured it out until today 😂 cannot wait for the ones on attention and transformers ❤ Again thank you for making these awesome videos they really helped me A LOT

  • @user-te7tu7tk8f
    @user-te7tu7tk8f Місяць тому +1

    Thank you, so I now can have intuition of why the name is encoder and decoder, that I've curious for full 1 years.

  • @Er1kth3b00s
    @Er1kth3b00s Рік тому +1

    Amazing! Can't wait to check out the Self-Attention and Transformers 'Quests!

  • @enestemel9490
    @enestemel9490 Рік тому +1

    Thank you Joshhh !!! I really love the way you teach everything

  • @bibhutibaibhavbora8770
    @bibhutibaibhavbora8770 7 місяців тому +1

    See this is the kind of explanation I was waiting for❤

  • @juaneshberger9567
    @juaneshberger9567 10 місяців тому +1

    Best ML vids out there, thanks!

  • @utkarshujwal3286
    @utkarshujwal3286 4 місяці тому

    Dr. Starmer thanks for the video and I had a doubt about this one. While I could understand the training cycle of the model I ain't quite sure about how inference testing is done, because during inference there wont be any tokens to be fed into the decoder side of the model, then how would it come up with a response?
    If I have to keep it crisp I couldnt understand how the architecture distinguishes training from inference? Is there some signal passed into the decoder side of the model.

    • @statquest
      @statquest  4 місяці тому

      For inference, we provide the context vector from the encoder and provide a start token () to the decoder, and then, based on that, the decoder creates an output token. If that token is , it's done, otherwise it takes that token as input the decoder again, etc...

  • @ririnch7408
    @ririnch7408 11 місяців тому +2

    Hello, thank you for the wonderful tutorial once again. Just a question about word2vec output of embedding values, I'm a bit confused as to how we can input multiple embedding values from one word input into LSTM input. Unrolling it doesn't seem to make sense since its based on one word, if so, do we sum up all these embedding values into another layer of y=x and with weights associated them in order to get a single value for a single word input?

    • @ririnch7408
      @ririnch7408 11 місяців тому

      Or do we use each individual embedding value as input for different LSTM cell? (Which would mean that we can have 100-1000+ LSTM cells per word)

    • @statquest
      @statquest  11 місяців тому

      When we have multiple inputs to a single LSTM cell, extra connections to each subunit are created with additional weights for the new inputs. So, instead of just one connection from the input to the subunit that controls how much of the long-term memory to remember, we have one connection per input to that same subunit, each with its own weight. Likewise, extra connections are added from the inputs to all of the other subunits.

  • @timmygilbert4102
    @timmygilbert4102 Рік тому

    Can't wait to see the stanford parser head structure explained as a step towards attention!

  • @yasharzargari4360
    @yasharzargari4360 6 днів тому +1

    This channel is awesome. Thank you

  • @cat-a-lyst
    @cat-a-lyst 11 місяців тому +1

    you are an excellent teacher

    • @statquest
      @statquest  11 місяців тому

      Thank you! 😃

  • @ygbr2997
    @ygbr2997 Рік тому

    using as the first input in the decoder to start the whole translation does appear to be magical

    • @statquest
      @statquest  Рік тому

      It's essentially a placeholder to get the translation started. You could probably start with anything, as long as you were consistent.

  • @sheldonsebastian7232
    @sheldonsebastian7232 Рік тому +1

    Yaas more on Transformers! Waiting for statquest illustrated book on those topics!

  • @GenesisChat
    @GenesisChat 2 місяці тому +1

    14:34 seems like a painful training, but one that, added to great compassion for other students, led you to produce those marvels of good education materials!

  • @harshilsajan4397
    @harshilsajan4397 5 місяців тому

    Hi great video!
    Just a question, to give the input to lstm, the input length will be constrained by lstm length right? For example 'let's' in first one and 'go' in second one.

    • @statquest
      @statquest  5 місяців тому

      I'm not sure what you mean by "lstm length". The idea here is that we can just copy the same sets of LMTMs as many times as we need to hand inputs of different lengths.

  • @Sarifmen
    @Sarifmen Рік тому +3

    We are getting to Transformers. LEETS GOOO

  • @alecrodrigue
    @alecrodrigue 6 місяців тому +1

    awesome vid as always Josh :)

  • @user-se8ld5nn7o
    @user-se8ld5nn7o Місяць тому

    Another amazing video and I cannot thank you enough to help us understand neural network in a such friendly way!
    At 4:48, you mentioned "because the vocabulary contains a mix of words and symbols, we refer to the individual elements in a vocabulary as tokens" . I wonder if this applies to models like GPT when it's about "limits of the context length (e.g., GPT3.5, 4096 tokens) or control the output token size.

    • @statquest
      @statquest  Місяць тому

      Yes, GPT models are based on tokens, however, tokens are usually word fragments, rather than whole words. That's why each word counts as more than one token.

  • @MariaHendrikx
    @MariaHendrikx 7 місяців тому +1

    Really well explained! Thnx! :D

  • @Nono-de3zi
    @Nono-de3zi 10 місяців тому

    What is the activation function used in the output fully connected layer (between the final short-term memories and the inputs to the Softmax)? Is it an identity activation gate? I see in various documentations "linear", "affine", etc.

    • @statquest
      @statquest  10 місяців тому +2

      In this case I used the identity function.

  • @Xayuap
    @Xayuap Рік тому

    hi, 9:00 does the deco connects to the encoder 1 on 1?
    or do we have to connect each deco output to each encoder input all to all fully connected fashion?

    • @statquest
      @statquest  Рік тому +1

      The connections are the exact same as they are within the encoder when we unroll the LSTMs - the long-term memories (cell states) that come out of one LSTM are connected the long-term memories of the next LSTM - the short term memories (hidden states) that come out of one LSTM are connected to the short-term memories of the next LSTM.

  • @xxxiu13
    @xxxiu13 7 місяців тому +1

    Great explanation!

  • @advaithsahasranamam6170
    @advaithsahasranamam6170 Рік тому +1

    Great explanation, love it!
    PS do you have a suggestion for where I can learn to work with seq2seq with tensorflow?

  • @coolrohitjha2008
    @coolrohitjha2008 9 місяців тому +1

    Great lecture Josh!!! What is the significance of using multiple LSTM cells since we already have multiple embeddings for each word?
    TIA

    • @statquest
      @statquest  9 місяців тому

      The word embeddings tell us about the individual words. The LSTM cells tell us how the words are related to each other - they capture the context.

  • @CelinePhan
    @CelinePhan Рік тому +1

    love your songs so much

  • @avishkaravishkar1451
    @avishkaravishkar1451 5 місяців тому

    Hi Josh. Are the 2 embeddings added up before it goes as an input to lstm?

    • @statquest
      @statquest  5 місяців тому

      They are multiplied by individual weights then summed and then a bias is added. The weights and bias are trained with backpropagation.

  • @ilirhajrullahu4083
    @ilirhajrullahu4083 5 місяців тому

    This channel is great. I have loved the series so far, thank you very much!
    I have a question:
    Why do we need a second layer for the encoder and decoder? Could I have achieved the same result using only 1 layer?

    • @statquest
      @statquest  5 місяців тому +1

      Yes. I just wanted to show how the layers worked.

  • @roczhang2009
    @roczhang2009 9 місяців тому +1

    Hey, thanks for your awesome work in explaining these complex concepts concisely and clearly! However, I did have some confusion after watching this video for the first time (I cleared them by watching it several times) and wanted to share these notes with you since I think they could potentially make the video even better:
    1. The "ir vamos y " tokens in the decoding layer are a bit misleading in two ways:
    a. I thought "ir" and "y" stood for the "¡" and "!" in "¡Vamos!" Thus, I was expecting the first output from the decoding layer to be "ir" instead of "vamos."
    b. The position of the "" token is also a bit misleading because I thought it was the end-of-sentence token for "¡Vamos!" and wondered why we would start from the end of the sentence. I think " ir vamos y" would have been easier to follow and would cause less confusion.
    2. [6:20] One silly question I had at this part was, "Is each value of the 2-D embedding used as an input for each LSTM cell, or are the two values used twice as inputs for two cells?" Since 2 and 2 are such a great match, lol.
    3. One important aspect that is missing, IMO, in several videos is how the training stage is done. Based on my understanding, what's explained in this video is the inference stage. I think training is also very worth explaining (basically how the networks learn the weights and biases in a certain model structure design).
    4. Another tip is that I felt as the topic gets more complicated, it's worth making the video longer too. 16 minutes for this topic felt a little short for me.
    Anyways, this is still one of the best tutorial videos I've watched. Thank you for your effort!!

    • @statquest
      @statquest  9 місяців тому +1

      Sorry you had trouble with this video, but I'm glad you were able to finally figure things out. To answer your question, the 2 embedding values are used for both LSTMs in the first layer. (in other words, both LSTMs in the first layers get the exact same input values). If you understand the basics of backpropagation ( ua-cam.com/video/IN2XmBhILt4/v-deo.html ), then really all you need to know about how this model is trained is how "teacher-forcing" is used. Other than that, there's no difference from a normal Neural Network. That said, I also plan on creating a video where we code this exact network in PyTorch and in that video I'll show how this specific model is trained.

    • @roczhang2009
      @roczhang2009 9 місяців тому +1

      Can't wait to learn the coding part from you too. And thanks for your patient reply to every comment. It's amazing. @@statquest

  • @Foba_Bett
    @Foba_Bett 3 місяці тому +1

    These videos are doing god's work. Nothing even comes close.

  • @WeightsByDev
    @WeightsByDev Місяць тому +1

    This video is very helpful... BAM!

  • @baocaohoang3444
    @baocaohoang3444 11 місяців тому +1

    Best channel ever ❤

  • @khaikit1232
    @khaikit1232 Рік тому +1

    Hi Josh,
    Thanks for the much-needed content on encoder-decoder! :)
    However, I had a few questions/clarifications in mind:
    1) Do the number of cells between each layer within the Encoder or Decoder be the same?
    2) From the illustration of the model, the information from the second layer of the encoder will only flow to the second layer of the decoder. Is this understanding correct?
    3) Building off from 2), does the number of cells from each layer of the Encoder have to be equal to the number of cells from each corresponding layer of the Decoder?
    4) Do the number of layers between the decoder & encoder have to be the same?
    I think my main problem is trying to visualise the model architecture and how the information flows if there are different numbers of cells/layers. Like how would an encoder with 3 layers and 2 cells per layer connect to the decoder that perhaps have only 1 layer but 3 cells.

    • @statquest
      @statquest  Рік тому +1

      First, the important thing is that there are no rules in neural networks, just conventions. That said, in the original manuscript (and in pretty much every implementation), the number of LSTMs per layer and the number of layers are always equal in the Encoder and the Decoder - this makes it easy for the context vector to connect the two sets of LSTMs. However, if you want to come up with a different strategy, there are no rules that say you can't do it that way - you just have to figure out how to make it work.

  • @Rumit_Pathare
    @Rumit_Pathare Рік тому +1

    you posted this video when I needed the most Thanks man and really awesome 👍🏻

  • @kadirkaandurmaz4391
    @kadirkaandurmaz4391 Рік тому +1

    Wow. Splendid!..

  • @jakemitchell6552
    @jakemitchell6552 Рік тому +2

    Please do a series on time series forecasting with fourier components (short-time fourier transform) and how to combine multiple frame-length stft outputs into a single inversion call (wavelets?)

    • @statquest
      @statquest  Рік тому +2

      I'll keep that in mind, but I might not be able to get to it soon.

  • @tupaiadhikari
    @tupaiadhikari 10 місяців тому +1

    Thank you Professor Josh, now I understand the working of Se2Seq models completely. If possible can you make a python based coding video either in Keras or Pytorch so that we can follow it completely through code? Thanks once again Professor Josh !

    • @statquest
      @statquest  10 місяців тому +2

      I'm working on the PyTorch Lightning videos right now.

    • @arshdeepkaur8842
      @arshdeepkaur8842 3 місяці тому +1

      Thanks@@statquest

  • @HAAH999
    @HAAH999 9 місяців тому

    When we connect the outputs from layer 1 to layer 2, do we connect both long/short memories or only the short term memory?

    • @statquest
      @statquest  9 місяців тому

      We connect the short term memories from one layer to the inputs of the next layer (which are different from the short term memories in the next layer).

  • @user-qd1sb6ho8l
    @user-qd1sb6ho8l 11 місяців тому

    Thank you, Josh. You are amazing.
    Would you please teach Graph Neural Networks?

    • @statquest
      @statquest  11 місяців тому

      I'll keep that in mind.

  • @amortalbeing
    @amortalbeing 6 місяців тому +1

    I liked it a lot. thanks ❤

  • @kmc1741
    @kmc1741 Рік тому +1

    I'm a student who studies in Korea. I love your video and I appreciate that you made these videos. Can I ask you when does the video about 'Transformers' upload? It'll be big help for me to study NLP. Thank you.

    • @statquest
      @statquest  Рік тому

      I'm working on it right now, so it will, hopefully, be out sometime in June.

  • @benetramioicomas3785
    @benetramioicomas3785 5 місяців тому +1

    Hello! Awesome video as everything from this channel, but I have a question: how do you calculate the amount of weights and biases of both your network and the original one? If you could break down how you did it, it would be very useful! Thanks!

    • @statquest
      @statquest  5 місяців тому

      I'm not sure I understand your question. Are you asking how the weights and biases are trained?

    • @benetramioicomas3785
      @benetramioicomas3785 5 місяців тому

      No, in the video, in the minute 15:48, you say that your model has 220 weights and biases. How do you calculaamte this number?

    • @statquest
      @statquest  5 місяців тому +1

      @@benetramioicomas3785 I wrote the model in PyTorch and then printed out all trainable parameters with a "for" loop that also counted the number of trainable parameters. Specifically, I wrote this loop to print out all of the weights and biases:
      for name, param in model.named_parameters():
      print(name, param.data)
      To count the number of weights and biases, I used this loop:
      total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

  • @spp626
    @spp626 Рік тому

    Hello Josh! I really like your videos and explanation. Here I m with a doubt. Can we use the data of last 150 years for stock price prediction like crude oil etc in time series using garch? I have done the analysis by garch model but does it seem an over large data? Or should I use data of last 50 or 60 years only? Could you please help me out? Thank you in advance.

    • @statquest
      @statquest  Рік тому

      Unfortunately I don't know much about GARCH.

    • @spp626
      @spp626 Рік тому +1

      @@statquest oh OK..thank you for your reply! 🤗

  • @hannahnelson4569
    @hannahnelson4569 4 дні тому +1

    This is pretty cool!

  • @siddharthadevanv8256
    @siddharthadevanv8256 11 місяців тому

    You're videos are really amazing... ❤ Can you make a video on boltzmann machines?

    • @statquest
      @statquest  11 місяців тому

      I'll keep that in mind.

  • @amrutumrankar4609
    @amrutumrankar4609 4 місяці тому

    In this full network where does we are telling to convert English word to Spanish word? for example in LSTM OR in Neural network before SoftMax function?

    • @statquest
      @statquest  4 місяці тому

      The whole thing does the job. There is no single part that does the translation.

  • @zhangeluo3947
    @zhangeluo3947 8 місяців тому

    Thank you so much sir for your clear explanation! But I have a question is that if you do word embedding for all tokens in d (let's say >2) dimensions, is that mean we can use the number of LSTM cells as d rather than just 2 cells for each layer? Or even more deep layers not just 2? Thank you!

    • @zhangeluo3947
      @zhangeluo3947 8 місяців тому +1

      Sorry, pardon my impatience, that's solved haha: 14:41

    • @statquest
      @statquest  8 місяців тому

      BAM! However, it's worth noting that an LSTM can also be configured to accept multiple inputs. So you could have a single LSTM layer that takes more than a single input.

  • @TonnyPodiyan
    @TonnyPodiyan Рік тому +1

    Hello Sir, I was going through your stats videos (qq plot, distribution etc)and loved your content. I would be really grateful, if you can make something regarding a worm plot. Nothing comes up on youtube when I search it.

  • @magtazeum4071
    @magtazeum4071 Рік тому

    Hi Josh, is there a bundle of pdf books on statictics to purchase in your store ? I already bought the studyguide on linear regression .

    • @statquest
      @statquest  Рік тому +1

      You can buy my book, and it has (among other things) Gradient Descent, Decision Trees, Naive Bayes (and Gaussian Naive Bayes).

    • @magtazeum4071
      @magtazeum4071 Рік тому

      @@statquest ok . Thank you . Is it available to buy on your website?

    • @statquest
      @statquest  Рік тому +1

      @@magtazeum4071 Yep: statquest.org/statquest-store/

    • @magtazeum4071
      @magtazeum4071 Рік тому +1

      @@statquest Thank you very much Josh ❤

    • @magtazeum4071
      @magtazeum4071 Рік тому +1

      @@statquest Just purchased it ❤

  • @yangminqi839
    @yangminqi839 10 місяців тому

    Hi Josh! Your video is amazing! But I have one question:
    When building the Encoder, you mentioned that 2 LSTM cells and 2 LSTM layer are used, I think one LSTM layer has only 1 LSTM cell (in terms of Pytorch's nn.LSTM) if we don't unroll, isn't it? So is there two different LSTM neural networks (nn.LSTM) are used, each one has two layers, and each layer has 1 LSTM cell? Or there is just one LSTM neural network with 2 layers, and 2 LSTM cells in one layer (this means nn.LSTM can have multiple LSTM cells) ? Which one is correct? I think is the former, please correct me if I'm wrong!
    Many Thanks!!

    • @statquest
      @statquest  10 місяців тому +1

      For nn.LSTM(), the "num_layers" parameter determines how many layers you have, and the "hidden_size" parameter controls how many cells are in each layer. Due to how the math is done, it may seem that changing "hidden_size" just makes a larger or smaller cell, but it's the equivalent of changing the number of cells. So, when I coded this, set "input_size=2", "hidden_size=2" and "num_layers=2". This is the equivalent of having 2 cells per layer and 2 layers.

    • @yangminqi839
      @yangminqi839 10 місяців тому

      @@statquest Thanks for your sincere reply! I think I have got your idea. You say that "hidden_size" parameter controls how many cells are in each layer, I think it's true under the situation that each cell generates a scalar output. But for Pytorch's nn.LSTMCell(input_size, output_size), only 1 nn.LSTMCell can transform the input of "input_size" to output of "output_size", which will involve some matrix multiplication not only scalar multiplication, isn't it? So even set "hidden_size=2" and "num_layers=2", I think the LSTM neural network has 2 layers and each layer have just 1 cell (nn.LSTMCell). Is my understanding right? Please correct me if I'm wrong.
      Thanks again!!!

    • @statquest
      @statquest  10 місяців тому

      @@yangminqi839 nn.LSTMCell() creates the equivalent of a stack of "cells" when you set hidden size > 1. This is "equivalent" because of how the math is implemented.

  • @user-dk3mk4il3g
    @user-dk3mk4il3g 6 місяців тому

    Hi sir, one question can there be a case where number of layers in decoder could be different than the encoder. Or it can never happen due to size of context vector? will adding a new layer in decoder give any advantage?

    • @statquest
      @statquest  6 місяців тому

      I don't know. It's possible that the context vector requires the number to be the same.

  • @szymonkaczmarski8477
    @szymonkaczmarski8477 Рік тому

    Great video! Finally some good explanation! I have a question regarding SOS and EOS tokens, sometimes it is mentioned that the decoder start the process of decoding by taking the SOS token, how does the whole picture differ then, for the both input sentences we always have then SOS and EOS tokens?

    • @statquest
      @statquest  Рік тому +1

      It really doesn't change anything since the embeddings and everything are learned based on what you use. If you use EOS to start things in the decoder, then the embeddings and weights in the decoder learn that EOS is what is used at the start. If you use SOS at the start in the decoder, then the decoder and weights in the decoder learn that SOS is what is used. It really doesn't matter.

    • @szymonkaczmarski8477
      @szymonkaczmarski8477 Рік тому +1

      @@statquest thank you! cannot wait for the transformers video!

  • @user-if6ny5dk9z
    @user-if6ny5dk9z 5 місяців тому +1

    Thank You Sir...................

  • @Jai-tl3iq
    @Jai-tl3iq 4 місяці тому

    Sir, So in encoder-decoder architecture, will the number of LSTM units be the same as the number of words in a sequence? I mean, I've seen in many drawings, illustrations, they have three words and the same number of three LSTM cells?

    • @statquest
      @statquest  4 місяці тому

      No. We set the number of LSTMs in advance of getting any input and then just unroll them as many times as needed for different input lengths.

  • @slolom001
    @slolom001 3 місяці тому

    Awesome videos! I was wondering how do people training larger models, know "im ready to press train" on the big version? Because if some of their assumptions were wrong they wasted all that time training. Is there some smaller version they can create to verify theyre getting good results, and theyre ready to train the big one?

    • @statquest
      @statquest  3 місяці тому +1

      Usually you start with a smaller training dataset and see how it works first.

  • @scorinth
    @scorinth Рік тому

    So, if I understand correctly, the context vector in this example has 8 dimensions?
    2 dimensions to the word embedding, times 2 since each layer outputs long and short term states, times two because there are two layers.
    So the context vector can be represented by 8 scalars...?

    • @statquest
      @statquest  Рік тому

      Each line that I drew for the "context vector" represents a single value, and there are 8 lines. The first layer of LSTMs has 2 LSTM cells, so it as 2 short-term memories and 2 long-term memories; 4 values total. The second layer of LSTMs also has 2 LSTM cells, so another 4 values. So there are 8 values in the context vector.

  • @BHAVYAJAIN-lw1fo
    @BHAVYAJAIN-lw1fo Рік тому +2

    cant wait for the tranformers video

    • @statquest
      @statquest  Рік тому +1

      Me too. I'm working on it right now.

  • @dslkgjsdlkfjd
    @dslkgjsdlkfjd 6 місяців тому

    Do the LSTMS in the second layers have the same weights and biases as the LSTMS in the first layer? Sorry if I missed that part.

    • @statquest
      @statquest  6 місяців тому

      This question is answered at 8:48

    • @dslkgjsdlkfjd
      @dslkgjsdlkfjd 6 місяців тому

      Ahhh thank you this clears up the LSTMs in the encoder and decoder. However are the weights in biases in the 2 different LSTM cells in the encoder at layer 1 different form the weights and biases in the 2 different LSTM cells in the encoder at layer 2? Thank you for amazing response time on my first message! @@statquest

    • @statquest
      @statquest  6 місяців тому

      @@dslkgjsdlkfjd This is answered at 6:41

    • @dslkgjsdlkfjd
      @dslkgjsdlkfjd 6 місяців тому +1

      BAM!!!@@statquest

  • @bfc7649
    @bfc7649 День тому +1

    Love your vids

  • @mitchynz
    @mitchynz 11 місяців тому

    Hi Josh - this one didn't really click for me. There's no 'aha' moment that I get with almost all your videos. I think we need to walk through the maths - or have a a follow up - even if it takes an hour. Perhaps a guest lecturer or willing student (happy to offer my time) ... alas I guess as the algorithms become more complex the less reasonable this becomes, however you did a masterful job simplifying CNN's that I've never seen elsewhere so I'm sure if anyone can do it, you can! Thanks regardless - there's a lot of joy in this community thanks to your teaching.

    • @statquest
      @statquest  11 місяців тому

      Yeah - it was a little bit of a bummer that I couldn't do the math all the way through. I'm working on something like that for Transformers and we'll see if I can pull it off. The math might have to be a separate video.

  • @bobuilder4444
    @bobuilder4444 2 місяці тому

    Do you need the same number of lstm cells as there are embedding values?

    • @statquest
      @statquest  2 місяці тому

      Technically no. If you have more embedding values, you can add weights to the connections to an LSTM unit and then sum those products to get the desired number of input values. If you have fewer embedding values,, you can use extra weights to expand their number.

  • @user-km8ou2ml2d
    @user-km8ou2ml2d 7 місяців тому

    Is the matching of number of embeddings to number of LSTM cells per layer a coincidence or does each LSTM cell read/receive one of the embedding dimensions? (simple example had 2 -> 2, Seq2Seq paper had 1000 -> 1000)

    • @statquest
      @statquest  7 місяців тому

      It's just coincidence. We could have 10 embedding values and just 1 LSTM per layer.

  • @chrischauhan1649
    @chrischauhan1649 7 місяців тому

    I have one question, in encoder you say we have 2 layers of LSTMs with 2 LSTM cells in each layer, why we didn't count stacked LSTM cells (if we do, we would have 4 LSTM cells per layer). Can you explain that? Also considering Pytorch in torch.nn.LSTM(), here we would have input_size = 2, num_layer= 2 what would be the hidden_size =2 or 4?

    • @statquest
      @statquest  7 місяців тому

      To be honest, I'm not sure I understand your first question. If you are asking why I decided to use 2 LSTMs per layer, and 2 layers, then the answer is that I thought that was the minimum that I could use to illustrate the concepts of how the layers work.
      For your second question, I set hidden_size=2. This creates two outputs per word per layer.

    • @chrischauhan1649
      @chrischauhan1649 7 місяців тому

      @@statquest sorry for the confusion, my question is regarding the additional LSTM cell you added to the stage at 6:08, that's what I mean by stacked LSTM cells (as they are shown as stack cells one on another) and that's what I was counting.

    • @statquest
      @statquest  7 місяців тому

      @@chrischauhan1649 Unfortunately I still don't understand your question. I've got 2 LSTMs per layer, and 2 layers, for a total of 4 LSTMs.

  • @datasciencepassions4522
    @datasciencepassions4522 Рік тому +1

    Awesome!

  • @harshmittal63
    @harshmittal63 3 місяці тому

    Hi Josh, I have a question at time stamp 11:54.
    Why are we feeding the token to the decoder, shouldn't we feed the (start of sequence) token to initiate the translation?
    Thank you for sharing these world-class tutorials for free :)
    Cheers!

    • @statquest
      @statquest  3 місяці тому

      You can feed whatever you want into the decoder to get it initialized. I use because that is what they used in the original manuscript. But we could have used .

  • @dsagman
    @dsagman 7 місяців тому +1

    this is my homework assignment today. how did youtube know to put this in my feed? maybe the next statquest will explain. 😂

  • @shashankagarwal4047
    @shashankagarwal4047 10 місяців тому +1

    Thanks!

    • @statquest
      @statquest  10 місяців тому

      Hooray!!! Thank you so much for supporting StatQuest!!! TRIPLE BAM!!! :)

  • @falconer8518
    @falconer8518 Рік тому

    To train the model, it is enough to make a backprop for the decoder Or you need to update the weights for the encoder ?

    • @statquest
      @statquest  Рік тому

      Remember the weights and biases start out as random numbers, so if you don't train the weights and biases in the Encoder, you might as well replace the LSTMs and the Embedding Layer with just a random set of weights. So, if you want to go through the trouble of adding LSTMs and embedding layers to the Encoder, you should train those weights and biases.

    • @falconer8518
      @falconer8518 Рік тому

      @@statquest How do I update weights and bais in encoder ? Do I need to account for all inputs or update only the last one ? I don't really understand how to implement this because it scares me that the encoder doesn't have a part with a linear layer

    • @statquest
      @statquest  Рік тому

      @@falconer8518 A forward step through the entire network begins with the encoder and ends after predicting two tokens, one for "vamos" and one for EOS. You then pass both of the predicted tokens and the known values (vamos and EOS) into the loss function. If this is confusing, don't worry, I'll make a video about how to do it as soon as I can. In the mean time, check out Ben's github tutorials: github.com/bentrevett/pytorch-seq2seq/tree/rewrite

    • @falconer8518
      @falconer8518 Рік тому +1

      @@statquest Thank you so much for your help, I think I'm starting to understand how to implement this, I also hope that you will make a video on the topic of backprop for seq2seq model, since there is very little material on the Internet about this and usually one to many rnn is disassembled

  • @vishnuthanki8966
    @vishnuthanki8966 Рік тому

    Hey Josh....I have a request to make and this is little bit off-track but can you also provide help in understanding Cox Regression and Survival Analysis thing ??

  • @prashlovessamosa
    @prashlovessamosa Рік тому +1

    Damm again awesome stuff.

  • @harshvardhankhanna7030
    @harshvardhankhanna7030 7 місяців тому

    I did not understand the use of multiple LSTM cells . Its like training two separate neural networks on same problem with networks having no connection. How does two separate cells help each other learn better?
    Thanks in advance for the reply.

    • @statquest
      @statquest  7 місяців тому

      All any neural network does is fit a shape to your data (for details, see: ua-cam.com/video/CqOfi41LfDw/v-deo.html ). And the complexity of that shape is determined, in part by the number of weights and biases that we have in our model as well as the number of non-linear activation functions we have. So the more weights we have, and the more activation functions we have, the more complicated a shape we can fit to more complicated data. In the video ( ua-cam.com/video/CqOfi41LfDw/v-deo.html ), I show how one activation function can make a bent line, but with 2, we can make a more complicated squiggle shape. Well, adding more LSTMs simply allows us to create an even more complicated shape to fit to our data.

  • @soukainafatimi7414
    @soukainafatimi7414 11 місяців тому

    thank you for this video and for all the efforts. i have one question , what is the diffrence between the lstm cells and the lstm units? what is the dimention of the hidden state and cell state of the example in the video ?
    i am really confused

    • @soukainafatimi7414
      @soukainafatimi7414 11 місяців тому

      and what about having multiple units in the lstm cell what would be the dimension of the vector contexte in this case ?

    • @statquest
      @statquest  11 місяців тому

      "Cell" and "Unit" mean the same thing and refer to a single LSTM, with a single set of weights and biases. We can then stack "cells" and "units", so that each one receives the same input values, but has independent weights and biases. We can then have multiple layers of LSTMs, each with its own stack of LSTM cells. The layers allow the output of one layer of LSTMs be used as input to the next layer of LSTMs.
      In the video, the dimension of the hidden state for the first layer is 2, since we have 2 LSTM cells. The dimension of the cell state for the first layer is also 2 for the same reason. Likewise, the hidden and cell states in the second layer both have 2 dimensions.

    • @soukainafatimi7414
      @soukainafatimi7414 11 місяців тому +1

      ​@@statquest thank you for answering , i really was in need to that answer.. by the way, i love this channel and the videos are uploaded just when i needed them (lucky me ) .. i was just learning about the attention network then BAM statQuest uploaded a video about them.
      and now (DOUBLE BAM)
      thank you again .

  • @theneumann7
    @theneumann7 Рік тому +1

    perfect as usual🦾

  • @anupmandal5396
    @anupmandal5396 6 місяців тому +1

    Awesome Video. Please make a video on GAN and BPTT. Request.....

    • @statquest
      @statquest  6 місяців тому +1

      I'll keep those topics in mind.

    • @anupmandal5396
      @anupmandal5396 6 місяців тому +1

      @@statquest Thank you sir.

  • @101alexmartin
    @101alexmartin 5 місяців тому +1

    Thanks for the video Josh, it’s very clearly explained.
    I have a technical question about the Decoder, that I might have missed during the video. How can you dynamically change the sequence lenght fed to the Decoder? In other words, how can you unroll the decoder’s lstms? For instance, when you feed the token to the (let’s say, already trained) Decoder, and then you get and feed it together with the token, the length of the input sequence to the decoder dynamically grows from 1 () to 2 (+). The architecture of the NN cannot change, so I’m unsure on how to implement this.
    Cheers! 👍🏻👍🏻

    • @statquest
      @statquest  5 місяців тому +1

      When using the Encoder-Decoder for translation, you pass the tokens (or words) to the decoder one at a time. So we start by passing to the decoder and it predicts "vamos". So then we pass "vamos" (not + vamos) to the same decoder and repeat, passing one token to the decoder at a time until we get .

    • @101alexmartin
      @101alexmartin 5 місяців тому

      @@statquest Thanks for the reply. I see your point. Do you iterate then on the whole Encoder-Decoder model or just on the Decoder? In other words, is the input to the model Let’s + go + in the first iteration? Or do we just run the Encoder once to get the context vector and iterate over the Decoder, so that the input is just one word at a time (starting with )? In this last case, I assume we have to update the cell and hidden states for each new word we input to the Decoder

    • @statquest
      @statquest  5 місяців тому +1

      @@101alexmartin In this case, we have to calculate the values for input one word at a time, just like for the output - this is because the Long and Short Term memories have to be updated by each word sequentially. As you might imagine, this is a little bit of a computational bottleneck. And this bottleneck was one of the motivations for Transformers, which you can learn about here: ua-cam.com/video/zxQyTK8quyY/v-deo.html and here: ua-cam.com/video/bQ5BoolX9Ag/v-deo.html (NOTE: you might also want to watch this video on attention first: ua-cam.com/video/PSs6nxngL6k/v-deo.html )

    • @101alexmartin
      @101alexmartin 5 місяців тому

      @@statquest thanks for your reply. What do you mean by calculating the values for the input one word at a time? Do you mean that the input to the model in the first iteration would be [Let’s, go, EOS] and for the second iteration it would be [Let’s, go, vamos]? Or do you mean that you only use the Encoder once, to get the context vector output when you input [Let’s, go], and then you just focus on the Decoder, initializing it with the Encoder context vector in the first iteration, and then iterating over the Decoder (i.e over a LSTM architecture built for an input sequence length of 1), using the cell and hidden states of previous iterations to initialize the LSTM, until you get [EOS] as output?

    • @statquest
      @statquest  5 місяців тому +1

      @@101alexmartin What I mean is that we start by calculating the context vector (the long and short term memories) for "let's". Then we plug those values into the unrolled LSTMs that we use for "go", and keep doing that, calculating the context vector one word at a time, until we get to the end up of the input. Watching the video on Transformers may help you understand the distinction that I'm making here between doing things sequentially vs. in parallel.

  • @marswang7111
    @marswang7111 Рік тому +1

    Love it

  • @pranaymandadapu9666
    @pranaymandadapu9666 6 місяців тому

    First of all, thank you so much for the clear explanation!
    I was confused when you said in the decoder during training that the next word we will give to the LSTM is not the predicted word, but we will use the word in training data. How will you let the network know whether the predicted token is correct?

    • @statquest
      @statquest  6 місяців тому

      I'm working on a video on how to code and train these networks that will help make this clear. In the mean time, know that we just compare all of the predicted output values to what we know should be the output values.

    • @pranaymandadapu9666
      @pranaymandadapu9666 6 місяців тому

      @@statquest thank you so much!

  • @wellingtonereh3423
    @wellingtonereh3423 9 місяців тому

    Thank you for the content. I have three questions:
    1) I've studied bentrevett github implementation and I've noticed that the size of LSTM hidden layers are 512. But the input for LSTM is 256(size of embeddings). The hidden layer output from LSTM shouldn't be 256? I understood the layers, for example, when I printed the shapes:
    hidden shape: torch.Size([2, 1, 512])
    cell shape: torch.Size([2, 1, 512]) ,
    I know I have size 2 because the LSTM have 2 layers. But the number 512 crash my head.
    2) Cells are long short memory and hidden layers are short memory?
    3) How batch size affects the model? If my batch size is 1, my sentence will be encoded in context vector and decoded in second LSTM. But if I pass 2 or more sentences, my encoder will handle it?

    • @statquest
      @statquest  9 місяців тому

      1) I'll be able to give you more details when I create my video on how to code LSTM seq2seq models
      2) Yes
      3) See the answer to #1.

    • @wellingtonereh3423
      @wellingtonereh3423 9 місяців тому +1

      @@statquest thank you very much!

  • @user-il8vc4pc5f
    @user-il8vc4pc5f 2 місяці тому

    Didn't understand the part that when we are using 2 LSTM cells per layer, Since the input to these states is the same and we are training it the same way why would the weight parameters be any different. Pls correct me if I'm wrong.

    • @statquest
      @statquest  2 місяці тому

      The parameters would be different because they started with different random initial values.

    • @ishangarg2227
      @ishangarg2227 2 місяці тому +1

      Great thanks for the reply, means a lot.

  • @rrrprogram8667
    @rrrprogram8667 Рік тому +2

    Hey... Hope u r doing good.....
    So u are about to reach MEGA BAMMMMM

  • @TheApgreyd
    @TheApgreyd Рік тому

    Is it possible to see more vids about vanilla pytorch?

    • @statquest
      @statquest  Рік тому

      I'll keep that in mind. However, vanilla PyTorch is much harder to code than PyTorch Lightning, especially when running the code in the cloud with multiple GPUs.

  • @uebyCyka
    @uebyCyka Рік тому

    So, how do we train the encoder part? Or is it already pretrained? Like word2vec?

    • @statquest
      @statquest  Рік тому

      Everything, the whole encoder-decoder structure, is trained just like I describe at 12:58

    • @uebyCyka
      @uebyCyka Рік тому

      @@statquest But how can the output for encoder be estimated?
      Thx for the answer

    • @statquest
      @statquest  Рік тому

      @@uebyCyka We can backpropagate (via The Chain Rule) from the output from the decoder through the context vector into the LSTMs in the encoder.

    • @uebyCyka
      @uebyCyka Рік тому

      @@statquest Oh. Thx again!

  • @Luxcium
    @Luxcium 11 місяців тому

    Oups 🙊 What is « *Seq2Seq* » I must go watch *Long Short Term-Memory* I think I will have to check out the quest also *Word Embedding and Word2Vec…* and then I will be happy to come back to learn with Josh 😅 I am impatient to learn *Attention for Neural Networks* _Clearly Explained_