Blowing up Transformer Decoder architecture

Поділитися
Вставка
  • Опубліковано 4 чер 2024
  • ABOUT ME
    ⭕ Subscribe: ua-cam.com/users/CodeEmporiu...
    📚 Medium Blog: / dataemporium
    💻 Github: github.com/ajhalthor
    👔 LinkedIn: / ajay-halthor-477974bb
    RESOURCES
    [ 1 🔎] Blowing up the encoder archtecture: • Blowing up the Transfo...
    [ 2 🔎] Code for building transformers from scratch: github.com/ajhalthor/Transfor...
    PLAYLISTS FROM MY CHANNEL
    ⭕ Transformers from scratch playlist: • Self Attention in Tran...
    ⭕ ChatGPT Playlist of all other videos: • ChatGPT
    ⭕ Transformer Neural Networks: • Natural Language Proce...
    ⭕ Convolutional Neural Networks: • Convolution Neural Net...
    ⭕ The Math You Should Know : • The Math You Should Know
    ⭕ Probability Theory for Machine Learning: • Probability Theory for...
    ⭕ Coding Machine Learning: • Code Machine Learning
    MATH COURSES (7 day free trial)
    📕 Mathematics for Machine Learning: imp.i384100.net/MathML
    📕 Calculus: imp.i384100.net/Calculus
    📕 Statistics for Data Science: imp.i384100.net/AdvancedStati...
    📕 Bayesian Statistics: imp.i384100.net/BayesianStati...
    📕 Linear Algebra: imp.i384100.net/LinearAlgebra
    📕 Probability: imp.i384100.net/Probability
    OTHER RELATED COURSES (7 day free trial)
    📕 ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
    📕 Python for Everybody: imp.i384100.net/python
    📕 MLOps Course: imp.i384100.net/MLOps
    📕 Natural Language Processing (NLP): imp.i384100.net/NLP
    📕 Machine Learning in Production: imp.i384100.net/MLProduction
    📕 Data Science Specialization: imp.i384100.net/DataScience
    📕 Tensorflow: imp.i384100.net/Tensorflow
    TIMESTAMP
    0:00 Introduction
    2:00 What is the Encoder doing?
    3:30 Text Processing
    5:05 Why are we batching data?
    6:03 Position Encoding
    6:34 Query, Key and Value Tensors
    7:57 Masked Multi Head Self Attention
    15:30 Residual Connections
    17:47 Multi Head Cross Attention
    21:25 Finishing up the Decoder Layer
    22:17 Training the Transformer
    24:33 Inference for the Transformer

КОМЕНТАРІ • 55

  • @SarvaniChinthapalli
    @SarvaniChinthapalli Місяць тому +2

    mind BLOWING..lucky enough to find your lectures

  • @ahmadfaraz9279
    @ahmadfaraz9279 9 місяців тому +8

    I've been closely following the Transformer playlist, which has greatly helped in my comprehension of the Transformer Architecture. Your excellent work is evident, and I can truly appreciate the dedication you've shown in simplifying complex concepts. Your approach of deconstructing intricate ideas into manageable steps is truly praiseworthy. I also find it highly valuable how you begin each video with an overview of the entire architecture and contextualize the current steps within it. Your efforts are genuinely commendable, and I'm sincerely grateful for your contributions. Thank you.

  • @JoeChang1999
    @JoeChang1999 Рік тому +3

    Your drawing skill is actually amazing!

  • @amiralioghli8622
    @amiralioghli8622 10 місяців тому +3

    Thank you for providing the video.
    Thank you for being you on UA-cam.
    I followed all of the tutorial.
    Your explanation, visualization and clear coding skills are wonderful.
    As well as I have a request, if possible please create a tutorial on Adapting Time Series Data into Transformers, working with many datasets and classify/forecast them using Transformer Network. There are a lot of followers like me who could not find any clear video on this topic.

  • @galileo3431
    @galileo3431 Рік тому +3

    Man you're a pure treasure! Keep up this outstanding work! 🙏🏼

  • @limbenny22
    @limbenny22 Рік тому +2

    truly amazing video, I have read the original paper but this video definitely helped me to understand it better, especially the way that you visualize the whole architecture.

    • @CodeEmporium
      @CodeEmporium  Рік тому

      Glad you saw it that way! More to come!

  • @user-jy4jz2qu8w
    @user-jy4jz2qu8w 15 днів тому

    Thank you! Your video makes me know a lot

  • @MatheusHenrique-jz1dc
    @MatheusHenrique-jz1dc Рік тому

    Excellent again, thank you!!

  • @lathashreeh5157
    @lathashreeh5157 Рік тому +1

    Great video !!! Clear explanation about dimensions and the whole process.

  • @pierrelebreton7634
    @pierrelebreton7634 Рік тому +1

    Thanks that is very clear!

  • @RanDuan-dp6oz
    @RanDuan-dp6oz Рік тому

    Great work! It is really great that you can draw such a complex diagram. Can you share which software you're using to draw it?

  • @lakshman200
    @lakshman200 6 місяців тому

    This is Awesome!!!!
    thank you so much for the video!!!!!!

  • @jonfe
    @jonfe Рік тому +3

    Can you explain in other video, examples of vectors of Q K V ? is still confusing for me what they represent.

  • @sarahgh8756
    @sarahgh8756 3 місяці тому

    Thank you for all the videos about transformer. Although I understood the architecture, I still dont know what to set for the input of the decoder (embeded target) and mask for the TEST phase?

  • @1nicolasdr
    @1nicolasdr Рік тому +1

    Illustrating your explanations with code actually provides much deeper insights. Thanks, man! Quick note on this video: I was wondering why you haven't included the "output embeddings" in your sketch of the decoder?

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Thanks for commenting and the kind words. For the sketch, the point was to focus on the architecture. But I do hope all of the later videos in this series clear up what the output of the decoder looks like. Currently releasing the training code next week too. So hope that helps even more

  • @hajrawaheed9636
    @hajrawaheed9636 Місяць тому

    Great work indeed. Helped clear a lot of things especially the part where softmax is used for the decoder output. So the first row will output the target lang first word. But in scenarios where two source words resonate with one target lang word, how is softmax handled their? Can you please help me in figuring this out.

  • @andreabonvini
    @andreabonvini 5 місяців тому

    Excellent video @CodeEmporium 👏 One question: at minute 20:00 you say that we don’t need the look-ahead mask for the cross-attention layer at inference time, but this is valid during training too, right?

  • @tiffanyk2743
    @tiffanyk2743 Рік тому +2

    Will you make a video on transformers using vision transformer + transfotmer decoder for image captioning?

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      I shall get to this at some point. For the next series of videos, I am going to go through the history of language models so we truly understand why transformers and ChatGPT have the architectures they do. Once this series is complete, I will take this on or some other topics. Whatever videos come out, I am sure they will be helpful and fun :)

  • @user-jx2en8mo2b
    @user-jx2en8mo2b 2 місяці тому

    Dude, you resemble Ryan from The Office! Btw great explanation. Thanks for posting such wonderful content.

  • @user-ul2mw6fu2e
    @user-ul2mw6fu2e Рік тому

    You are great

  • @creativeuser9086
    @creativeuser9086 Рік тому

    By the way, when you do the dot product between q and K^T, it won't directly refer to cosine similarity (closer vectors mean more related terms) unless the magnitudes are normalized. But clearly, the vectors are not of the same magnitude, so how is the dot product a metric of similarity?

  • @fayezalhussein7115
    @fayezalhussein7115 Рік тому

    bravoo, the best expalin i ever seen, bro could you explain the implemention of cnn+ swin transformer

  • @kollivenkatamadhukar5059
    @kollivenkatamadhukar5059 7 місяців тому

    Ajay can you provide the link to the architecture diagram that you are using for explanation it would be of great help

  • @sandhyas2033
    @sandhyas2033 3 місяці тому

    Great work Ajay, Can you share the diagram link which you have showed in the video?

  • @sigurdenghoff2170
    @sigurdenghoff2170 Рік тому

    When the data is fed through the network N times (21:45), does each pass through the network use the same weights or is a different set of weights used for each pass?

    • @learnaiwithjoelbunyan4764
      @learnaiwithjoelbunyan4764 11 місяців тому +1

      Different Set of weights. In order to add to make the model more deeper and make the model track more useful features.

    • @AshishBangwal
      @AshishBangwal 7 місяців тому +1

      No, data is not fed at each layer(vertical layer) , each layer takes input from previous layer (except the first one) and reach layer has its own weight..

  • @philipbutler
    @philipbutler Рік тому +1

    7:00 I feel as though the implementations that just repeat the Q K V matrices are making a mistake, mostly because the purpose of multihead attention is to learn different attentions right? In the attention blocks the linear layers / learnable parameters are at the beginning for each Q K V, then one big one after the heads are concatenated, so without the individual ones at the beginning (I’m assuming each initialized to random values) I believe the multiple heads would be useless. Thoughts or corrections?

    • @philipbutler
      @philipbutler Рік тому +1

      Ohhh, I just continued on to them getting divided by the number of heads. I thought the heads each worked with the whole matricies

    • @philipbutler
      @philipbutler Рік тому

      I’m more confused now but I think in a good way because I’m a bit closer to understanding

    • @philipbutler
      @philipbutler Рік тому +1

      I’m even more confused because I’m realizing in encoder-decoder attention, Q comes from the decoder, K V from encoder, but I feel like it would make more sense for Q to come from encoder, and K and V to come from the decoder… because in the english-french example, it would be like asking, What’s this english sentence in french? then checks the compatibility of the english tokens with the french tokens, then multiplies these compatibilities to the french tokens for output
      Also I still feel like dividing the tokens into pieces would be an unnecessary set back

  • @AbdulRahman-tj3wc
    @AbdulRahman-tj3wc 8 місяців тому +1

    While we are yet to translate the sentence to kanada, how can we pass it to the decoder??

  • @supremachine
    @supremachine 5 місяців тому

    At the end of the decoder block, isn't there supposed to be another "Add & Norm" operation as in the architecture? Did he miss it?

  • @user-om3qk3cn4l
    @user-om3qk3cn4l 11 місяців тому

    multi cross attention will only work if the seq length for encoder and decoder is same. but what if it isn't?

    • @jackwoo5288
      @jackwoo5288 10 місяців тому +1

      just project qkv into a multiable space with some learnable matrix

  • @jackwoo5288
    @jackwoo5288 10 місяців тому +1

    One thing I don't understand is that at 20:35 , the matrix obtained by multiplying the cross-attention matrix, derived from the encoder, with the v matrix is said to represent one English word per row. But the q part of the cross-attention matrix comes from the Kannada sentences in the masked attention, shouldn't each row of the resulting matrix correspond to a Kannada word?

    • @AshishBangwal
      @AshishBangwal 7 місяців тому

      The resulting matrix from (qT.k) is just attention matrix, and when this matrix is multiplied with v we get final representation, which is attentive to both the encoder output(english sentence) and decoder output(kannada sentence) hence the name cross-attention

    • @jackwoo5288
      @jackwoo5288 7 місяців тому

      @@AshishBangwal I totally agree with your thought on where the name “cross attention” comes from.Yet my point here is that since vector q is derived from decoder ahead ,its dimension should be max token length of Kannada * 64.Then the amount of rows of resulting matrix after multiplying attention matrix with vector v derived from encoder ought to equal to max Kannada token length.Hence each of this matrix’s row should stands for a Kannada token instead of an English one.

    • @AshishBangwal
      @AshishBangwal 7 місяців тому

      @@jackwoo5288 Appologies for not getting your point for the first time.😅
      I am not sure if I'm correct, but i have wrote a explanation(what i understood) below. PS: i wrote it in matrix, but i think its similar for vector.
      If I understood correctly let the output of the encoder will be (512xTe) so the K (key matrix) and V (value matrix) for cross-attention step will be of same dimension (512xTe), and our decoder output after masked-multi-head attention will be (512xTd) so the Q (quesry matrix) will be (512xTd)
      So If we go with the formula (Q^T.K) we will get attention matrix as (Td x Te) and then we do its multiplication with V^T we will get (Td x 512)
      It sounds more confusing when i read it again, lol. But ig that's the fun part

  • @astaragmohapatra9
    @astaragmohapatra9 Рік тому

    This is really a great video, thanks, man! Can you share the pdfs for the diagram also?

    • @CodeEmporium
      @CodeEmporium  Рік тому

      Thanks so much. Yea. I can’t seem to export this as a diagram from the whiteboard software. So I am planning to sketch both the combined encoder and decoder out together and have it accessible as like a PDF. It should be done in the coming weeks

    • @astaragmohapatra9
      @astaragmohapatra9 Рік тому

      @@CodeEmporium That's great, this would be immensely helpful. Thanks again

    • @Mr.AIFella
      @Mr.AIFella Рік тому

      @@CodeEmporium I can’t wait for the diagrams to be uploaded, I really need them, because I plan to draw them in the PowerPoint, but I got limited to the size of the PowerPoint slide (can’t fit in one place), then I tried to copy what I draw in the explain everything , then it got my time looking back and forth to the video seeing the diagram. So, I decided to check if you uploaded them or not. If you plan to do so, please do at your earliest convenience with my appreciation in advance for your tremendous work

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Рік тому

    Is it possible to get a pdf of the big diagram?

    • @CodeEmporium
      @CodeEmporium  Рік тому

      I was trying to do exactly that. I might create a separate cleaner diagram and circulate that as a PDF with the complete transformer architecture

    • @Mr.AIFella
      @Mr.AIFella Рік тому

      @@CodeEmporium May I know the software that you used to design the diagram?
      Thanks in advance,

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      It’s a white boarding tool called “Explain Everything”

  • @ShubhamAware18
    @ShubhamAware18 4 місяці тому

    can you please share this Image?

  • @ranam
    @ranam Рік тому

    The only question I can ask you brother is nothing rather than understanding these concepts are we going to be the next Elon Musk to create an gpt 4 or gpt infinity we are normal people who are going to use this technology mat be to earn money or earn a PhD and then earn money it's good to understand these things like we understand mechanics to understand the world arround us but space science and quantum mechanics are a variant only few will venture into it I do not say this is not important but only thing I say is your content are very good and unique but again it helps the people who are working in academic level or may be phd your way of deriving neural networks and machine learning algorithms mathematically is great but matrix calculus org which I tried your stuff is bit hard to me I don't say your not good at that but matrix calculus which is more important has no kind of computer algebra software codes to help it and it's rare my only advice is you are working to hard to make these videos but you have an class of tutoring which should be given in Oxford or Harvard iam really Proud to say you are an Indian specially south Indian but by this time you must have reached a million subscribers because of your genuine thoughts and to make a quality content people are judging you differently please don't take this as an advice please run with the folks you will be earning in millions and I have told to everyone about you still they don't have the knowledge to admire you for an intellectual admiration require not only brain it needs intellectual brain my only thing I have to tell you is you are a best book in the library un noticed I know the quality of best books not all books reveal the big picture you have an master class tutoring techniques which I have only seen in costly intellectual books and I don't know where you studied and are you a teacher, lecturer or professor you will surely be rewarded for your self less pure hearted efforts I have told everyone about you but in a world of fake gurus the true ones never blow there trumpet your contents are equivalent to a PhD ❤

  • @user-mr3se3jk1r
    @user-mr3se3jk1r 2 дні тому

    You have missed the concept of teacher forcing during training